feat: migrate website search and embeddings to crawler service by larryro · Pull Request #595 · tale-project/tale

larryro · 2026-02-27T16:11:37Z

Summary

Migrate website search indexing and embedding storage from Convex to the crawler service backed by PostgreSQL + pgvector
Replace custom fixed-size chunker with semantic-text-splitter MarkdownSplitter for structure-aware content splitting
Add search UI and chunk inspection to the website pages dialog
Make website registration non-blocking with background homepage crawl and scheduled metadata sync
Harden embedding config with required env var validation, improve error handling and website sync reliability

Changes

Crawler service: New routers (search, pages, index), services (chunking, embedding, indexing, search, database, pg_website_store), and config module
Database: Idempotent init scripts, new tale_search database with pgvector, drop legacy TimescaleDB
Convex platform: Remove website_page_embeddings/ module and website_pages workflow action; website operations now delegate to crawler via internal actions
Frontend: Search + chunk viewer in website pages dialog, async registration UX, improved error toasts

Test plan

Crawler service unit tests (test_chunking_service, test_embedding_service, test_indexing_service, test_search_service, test_database, test_*_router)
Register a new website and verify background crawl completes
Search website content and confirm relevant results
Inspect chunks in the pages dialog
Delete a website and confirm cascading cleanup in crawler

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added hybrid search combining full-text and vector similarity for improved content discovery.
- Enabled content indexing with automatic chunking for better searchability.
- Added page listing with pagination, filtering, and chunk viewing capabilities.
- Enhanced website status tracking with new states: idle and scanning.
Chores
- Migrated database backend from TimescaleDB to ParadeDB with pgvector support.
- Updated dependencies including asyncpg, tiktoken, and semantic-text-splitter.

greptile-apps · 2026-02-27T16:11:55Z

Too many files changed for review. (109 files found, 100 file limit)

…rvice Move search indexing and embedding storage from Convex to the crawler service backed by PostgreSQL + pgvector. Make all DB init scripts idempotent and run them on every container startup. Drop legacy TimescaleDB extension, ensure crawler upserts website_urls before inserting chunks, and add transactional cascading deletes when removing a website.

Add semantic search and chunk viewer to the website pages dialog. Make website registration non-blocking with background homepage crawl and scheduled metadata sync. Track crawled vs total page counts separately. Remove the standalone rescan action in favor of scheduler-driven scans. Improve reliability with exponential backoff retries, HNSW index auto-creation, scan cancellation, domain normalization, parameterized sort columns, and proper URL encoding.

…kdownSplitter Swap the hand-rolled fixed-size chunking logic for the semantic-text-splitter library's MarkdownSplitter, which splits at structural markdown boundaries (headers, code blocks, tables, paragraphs, sentences) while respecting the target chunk size. This removes ~120 lines of manual paragraph/sentence/hard-split code and the abbreviation-aware sentence splitter. Tests updated to validate markdown-aware behaviour (header boundaries, code blocks, tables) and relaxed implementation-detail assertions.

…_search

Make embedding dimensions required via env var with startup validation to prevent silent dimension mismatches. Improve error visibility across crawler (background task logging, HNSW warnings) and platform (error toasts, sync error tracking in metadata). Fix website delete ordering to deregister from crawler before removing DB record.

coderabbitai · 2026-02-27T16:20:03Z

📝 Walkthrough

Code Review Summary

Walkthrough

This pull request performs a comprehensive architectural migration of the website crawler and search infrastructure. The crawler service gains PostgreSQL-backed persistence (replacing SQLite), adds new services for content embedding, indexing, and hybrid search, and exposes additional REST API endpoints for indexing and search operations. The platform backend removes its internal website page embeddings model and websitePages table, replacing them with external API calls to the crawler service for search, page listing, and synchronization operations. The database configuration shifts from TimescaleDB to ParadeDB for enhanced full-text and vector search capabilities.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The diff introduces significant structural changes across multiple service boundaries: new asynchronous database integration and service orchestration in the crawler, replacement of a large embedded embedding/search subsystem in the platform, and refactoring of website synchronization and persistence patterns. The changes span heterogeneous components (database schema, multiple new services with interdependencies, API endpoint wiring, and removal of legacy functionality), necessitating careful verification of service initialization order, database consistency, external API contract alignment, and data flow correctness.

Possibly related PRs

perf(platform,crawler): optimize website scan with incremental sync and persistent cache #553: Adds SQLite-backed persistent store and scheduler to crawler service, modifying overlapping code paths in main.py, routers/websites, and scheduler with different persistence backend choice.
feat(operator,platform): refactor operator agent loop and improve platform reliability #499: Replaces internal websitePages/embedding logic with external crawler-based flow in services/platform/convex/websites/internal_actions.ts and related website helpers.
refactor(platform): consolidate crawler tools into unified web tool #369: Adds crawler "web" router endpoint for fetch-and-extract and updates platform agent/tool code to call external crawler service instead of internal embedding flows.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: migrate website search and embeddings to crawler service' accurately summarizes the main change: moving search indexing and embedding storage from Convex to a new crawler service backed by PostgreSQL + pgvector.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/migrate-search-to-crawler

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 37

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

services/platform/convex/websites/bulk_create_websites.ts (1)

8-33: 🧹 Nitpick | 🔵 Trivial

Reuse toWebsiteDomain to keep normalization logic single-sourced.

Bulk create still duplicates hostname extraction in Line 32. Calling the shared domain helper keeps behavior consistent across create paths and avoids future drift.

♻️ Proposed refactor

-import { ensureUrl } from './create_website';
+import { toWebsiteDomain } from './create_website';
@@
-      const normalized = new URL(ensureUrl(websiteData.domain)).hostname;
+      const normalized = toWebsiteDomain(websiteData.domain);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/websites/bulk_create_websites.ts` around lines 8 -
33, The loop in bulkCreateWebsites duplicates hostname normalization by calling
new URL(ensureUrl(...)).hostname; replace that with the shared helper
toWebsiteDomain to single-source normalization and avoid drift: inside
bulkCreateWebsites (function bulkCreateWebsites) call
toWebsiteDomain(websiteData.domain) instead of new URL(ensureUrl(...)).hostname
and remove the redundant ensureUrl usage so all create paths use the same
normalization logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compose.yml`:
- Around line 151-154: Replace the simple depends_on: - db with a health-based
dependency so the crawler waits until PostgreSQL is ready; add a Docker
healthcheck to the db service (or ensure it exists) and change the crawler's
depends_on entry to use condition: service_healthy (refer to the depends_on key
and the db service) so pg_store_manager initialization won't attempt connections
before the DB is accepting connections.

In `@services/crawler/app/config.py`:
- Around line 95-104: The get_embedding_dimensions method currently permits
non-positive integers; update get_embedding_dimensions (and its use of
self.embedding_dimensions and the EMBEDDING_DIMENSIONS env var) to validate that
the resolved value is an integer > 0: when reading from
self.embedding_dimensions accept only positive ints, when parsing
os.environ["EMBEDDING_DIMENSIONS"] convert with int inside a try/except and
raise a clear ValueError for non-integer input, and if the value is <= 0 raise
ValueError("EMBEDDING_DIMENSIONS must be a positive integer") so invalid values
fail fast at config boundary.

In `@services/crawler/app/routers/pages.py`:
- Around line 53-58: The LEFT JOIN that aggregates chunk counts currently groups
and joins only by url causing counts to bleed across domains; update the chunks
subquery (alias c) to SELECT domain, url, COUNT(*) AS chunks_count and GROUP BY
domain, url, and change the JOIN condition to match both c.url = wu.url AND
c.domain = wu.domain so chunk counts are domain-scoped when joining to the work
unit rows in pages.py.

In `@services/crawler/app/routers/websites.py`:
- Around line 53-55: The code currently does json.loads(sd) and will raise on
malformed JSON, aborting homepage ingestion; wrap the string parse in a
try/except (catch json.JSONDecodeError and ValueError) around the json.loads
call in the same block that checks isinstance(sd, str), and on failure set sd =
None (or structured_data = None if that name is used downstream) and log a
debug/warn message so execution proceeds to call extract_meta_description(sd)
with sd=None; reference the sd variable and the call to extract_meta_description
to locate where to change behavior.
- Around line 90-94: The code currently awaits asyncio.gather(_crawl_homepage(),
_discover_urls()) then unconditionally calls manager.update_scan_status(domain,
"active"); change the flow so failure in either background step results in a
non-active status: have _crawl_homepage and _discover_urls either return a
success boolean or re-raise on fatal error, call results = await
asyncio.gather(_crawl_homepage(), _discover_urls(), return_exceptions=True),
inspect results for exceptions or any False return values, and only call
manager.update_scan_status(domain, "active") (and update_last_scanned(domain))
when all tasks succeeded; otherwise call manager.update_scan_status(domain,
"failed") (or an appropriate error status) so failures aren’t masked.

In `@services/crawler/app/services/database.py`:
- Around line 22-23: Replace the implicit default weak DB password fallback with
a fail-fast check: if DATABASE_URL is provided return it directly; otherwise
require DB_PASSWORD to be set and raise a clear exception if it is missing
instead of defaulting to "tale_password_change_me". Update the code that reads
os.environ.get("DB_PASSWORD", ...) and the connection-string construction (the
variable password and the return
f"postgresql://tale:{password}@db:5432/tale_search") to validate presence and
raise an error with a helpful message so misconfiguration is detected at
startup.
- Around line 31-49: The pool created by asyncpg.create_pool is assigned to the
global _pool before post-creation checks, so any exception during initialization
(e.g., in settings.get_embedding_dimensions or the SELECT) can leave _pool open;
wrap the post-creation initialization and the embedding-dimension validation in
a try/except/finally (or try/except) so that on any exception you await
_pool.close() and set _pool = None before re-raising; apply this around the
block that uses _pool, including the stored_dims fetch and the configured_dims
comparison, referencing _get_database_url, asyncpg.create_pool, _pool,
settings.get_embedding_dimensions, and the stored_dims validation.

In `@services/crawler/app/services/embedding_service.py`:
- Around line 35-40: The code returns embeddings assuming response.data length
matches the input batch; add a cardinality guard after calling
self._client.embeddings.create to verify len(response.data) == len(batch) (where
response.data and batch are the unique symbols here) and if they differ raise a
clear exception (e.g., ValueError) or log and raise so downstream
chunk-to-embedding alignment cannot silently drift; keep the original return of
[item.embedding for item in response.data] only after the check passes.

In `@services/crawler/app/services/indexing_service.py`:
- Around line 66-72: The code builds the INSERT payload by indexing into
embeddings[i] without verifying embeddings length; before creating the list
comprehension used in the conn.executemany call, check that len(embeddings) ==
len(chunks) (or handle a mismatch explicitly), and if they differ, log or raise
a clear error and abort the insert to avoid an unhandled IndexError; update the
block around embeddings and chunks (the variables used in the INSERT preparing
logic) to validate cardinality and fail fast with a descriptive message or
handle the discrepancy according to the intended behavior (e.g., skip, pad, or
retry).

In `@services/crawler/app/services/pg_website_store.py`:
- Around line 134-135: The current conditional json.dumps(u["metadata"]) if
u.get("metadata") else None (and similarly for structured_data) treats empty
objects/arrays as falsy and converts them to NULL; change the condition to
preserve empty JSON by checking existence/non-null instead (e.g., use
json.dumps(u["metadata"]) if "metadata" in u or if u.get("metadata") is not
None, and the same for u["structured_data"]) so empty {} or [] are serialized
rather than dropped.

In `@services/crawler/app/services/scheduler.py`:
- Around line 214-226: The cancellation flag is being cleared too early by
_clear_cancelled(domain), so a cancellation requested just before scan start is
ignored; change the logic to check _is_cancelled(domain) before calling
_clear_cancelled(domain) (i.e., remove or move the early _clear_cancelled call
and only clear the queued cancellation after you've detected and handled the
cancellation or at the very end of the scan). Update the three occurrences
involving _clear_cancelled/_is_cancelled in scheduler.py (the initial block
around crawler_service.initialize and the later checkpoints referenced in the
review) so each checkpoint first calls _is_cancelled(domain) and handles/returns
accordingly, and only calls _clear_cancelled(domain) when acknowledging and
finishing cancellation or after completion.
- Around line 136-140: The HEAD-check AsyncClient is disabling TLS verification
(verify=False) which is unsafe; update the AsyncClient creation used for HEAD
checks (the async with httpx.AsyncClient(...) as client blocks referencing
_HEAD_TIMEOUT and client) to remove verify=False or set verify=True so TLS certs
are validated, and apply the same change to the other AsyncClient usage
mentioned (the second block around the other HEAD check). Ensure these changes
only affect the HEAD-check client creation in scheduler.py.
- Around line 294-296: The code currently calls json.loads(sd) unguarded which
can raise on malformed structured data and abort the scan; wrap the conversion
in a try/except catching json.JSONDecodeError (and a generic Exception
fallback), log or warn about the malformed payload, and set sd to a safe default
(e.g., {} or None) before calling extract_meta_description so the scan
continues; make this change around the existing isinstance(sd, str) branch that
handles sd and the subsequent call to extract_meta_description to ensure
malformed JSON doesn’t crash the scheduler.

In `@services/crawler/app/services/search_service.py`:
- Around line 76-102: The code serializes embeddings with json.dumps and relies
on casting in SQL; instead register asyncpg's pgvector/vector codec when
creating the connection pool (so pgvector values are sent/received natively) and
then modify _vector_search to pass the embedding list directly (remove vec_str =
json.dumps(embedding) and use embedding as the parameter) while keeping the same
SQL (with $1::vector cast if desired) so asyncpg will encode the Python list
into the pgvector type automatically; update pool creation/registration logic
(where self._pool is created) to call the asyncpg pgvector adapter/codec
registration so the driver knows how to handle vector values.

In `@services/crawler/pyproject.toml`:
- Around line 21-23: The dependency entries "asyncpg", "tiktoken", and
"semantic-text-splitter" use minimum version specifiers (>=) which creates
non-reproducible builds; update these entries in pyproject.toml to use pinned,
exact versions (==) consistent with the rest of the file—e.g., change
"asyncpg>=0.30.0", "tiktoken>=0.9.0", and "semantic-text-splitter>=0.18.0" to
explicit pinned versions (choose the tested/latest known-good patch versions) so
the project consistently uses exact pins across dependencies.

In `@services/crawler/tests/test_chunking_service.py`:
- Around line 141-150: The current assertion in
test_overlap_produces_shared_content is ineffective; change it to verify actual
shared content between adjacent chunks by checking for overlapping sentence
tokens rather than string concatenation length. Update the test to iterate
adjacent pairs from chunk_content(...) and assert that there exists at least one
sentence identifier (e.g., the substring "Sentence number X" or split by
sentences) that appears in both result[i].content and result[i+1].content;
reference the test function test_overlap_produces_shared_content and the
chunk_content call to locate the change.

In `@services/crawler/tests/test_config.py`:
- Around line 88-109: Add tests in TestGetEmbeddingDimensions that exercise
get_embedding_dimensions() with invalid env values: create one test that sets
EMBEDDING_DIMENSIONS (and another that sets CRAWLER_EMBEDDING_DIMENSIONS) to a
non-integer string like "abc" and asserts it raises ValueError (match
"EMBEDDING_DIMENSIONS"), and another test that sets each to non-positive values
("0" and "-1") and asserts ValueError; use the same patch.dict(os.environ, env,
clear=True) pattern and _base_env() setup so both prefixed
(CRAWLER_EMBEDDING_DIMENSIONS) and generic (EMBEDDING_DIMENSIONS) cases are
covered.

In `@services/crawler/tests/test_pages_router.py`:
- Around line 13-14: Replace uses of timezone.utc with the datetime.UTC alias
for the constants _DEFAULT_CRAWLED and _DEFAULT_DISCOVERED: change
tzinfo=timezone.utc to tzinfo=datetime.UTC in the datetime(...) calls, and
update imports so datetime.UTC is available (e.g., import datetime as datetime
or adjust from datetime import datetime and reference datetime.UTC accordingly).

In `@services/crawler/tests/test_search_service.py`:
- Around line 172-176: The test asserts a last-write-wins merge strategy but the
production code lacks documentation; add a concise comment or docstring inside
SearchService._merge_rrf stating that when multiple result lists contain the
same item id, later lists overwrite earlier item metadata (i.e.,
last-write-wins), so future maintainers understand this intentional behavior and
the contract relied upon by tests like test_later_list_overwrites_item_metadata.

In `@services/db/docker-entrypoint-wrapper.sh`:
- Around line 91-97: The schema/init step is run in the background causing a
readiness race; change the wrapper so that after pg_isready confirms Postgres is
accepting connections it calls run_init_scripts synchronously (not in a
background subshell) and only returns/exits once run_init_scripts completes,
ensuring schema/index creation finishes before the container signals readiness;
locate the block using pg_isready and run_init_scripts in
docker-entrypoint-wrapper.sh and remove the trailing '&' and surrounding
background subshell so the init runs in the foreground.
- Around line 86-87: The init script is swallowing failures because the pipeline
`psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -f "$script" 2>&1 | grep -E
"^(ERROR|NOTICE)" || true` forces a zero exit even when migrations fail; change
the logic so that `psql` errors cause a non-zero exit: run `psql -f "$script"`
capturing stdout/stderr, detect lines that start with "ERROR" (or simply rely on
psql's exit code) for the given "$script" and if any error is found exit with a
non-zero status and log the details (don’t use `|| true`), ensuring failures
during processing of "$script" bubble up and stop container startup.

In `@services/db/init-scripts/01-init-extensions.sql`:
- Around line 21-24: The current GRANT statements only affect existing objects
in schema "tale" (GRANT ALL PRIVILEGES ON SCHEMA tale; ON ALL TABLES IN SCHEMA
tale; ON ALL SEQUENCES IN SCHEMA tale TO CURRENT_USER), so add corresponding
default-privilege statements to ensure tables and sequences created later
inherit the same rights: run ALTER DEFAULT PRIVILEGES IN SCHEMA tale ... GRANT
ALL PRIVILEGES ON TABLES TO CURRENT_USER and ALTER DEFAULT PRIVILEGES IN SCHEMA
tale ... GRANT ALL PRIVILEGES ON SEQUENCES TO CURRENT_USER (use the same schema
"tale" and principal CURRENT_USER to match the existing grants).

In `@services/db/init-scripts/02-create-convex-database.sql`:
- Line 12: The script currently uses "DROP EXTENSION IF EXISTS timescaledb
CASCADE;" which can unintentionally drop dependent objects and data; update the
statement that drops the TimescaleDB extension (the DROP EXTENSION for
timescaledb) to remove the CASCADE keyword (i.e., use a non-cascading drop or
skip dropping entirely), so the startup script either uses "DROP EXTENSION IF
EXISTS timescaledb" without CASCADE or avoids dropping the extension at runtime
to preserve dependent objects and data.

In `@services/db/init-scripts/03-create-rag-database.sql`:
- Line 12: The startup script currently executes a destructive DROP EXTENSION IF
EXISTS timescaledb CASCADE which can remove dependent objects; remove that DROP
statement (the line containing "DROP EXTENSION IF EXISTS timescaledb CASCADE;")
from the init path and replace it with a non-destructive approach: ensure the
extension is created with a safe idempotent call (e.g., use a
create-if-not-exists flow) or move any DROP logic into an explicit migration
script that includes pre-checks, backups and operator approval; if you need to
perform version changes, implement them as separate migrations that check for
dependent objects and require manual approval rather than dropping the extension
during startup.

In `@services/platform/app/features/websites/components/websites-table.tsx`:
- Around line 37-44: The background sync in the useEffect block (key
`websites-sync-${organizationId}` using sessionStorage) calls syncStatuses({
organizationId }) fire-and-forget; add minimal error handling by awaiting or
attaching a .catch to syncStatuses and logging failures (via existing logger or
console.error) and optionally schedule a single silent retry/backoff (e.g.,
setTimeout) to avoid leaving stale status data; keep the 5-minute cooldown logic
intact and ensure errors do not throw to UI by swallowing them after logging.

In `@services/platform/convex/agent_tools/web/helpers/query_web_context.ts`:
- Around line 58-65: Wrap the single fetch call to the crawler (the block using
getCrawlerServiceUrl(), controller.signal, query and limit) in a short inline
retry loop: attempt the POST up to a few times (e.g., 3 attempts) with a short
awaitable delay (~1s) between attempts, catching transient errors (network
failures, timeouts, and HTTP 5xx/429 responses) and only rethrowing after
exhausting retries; preserve and pass controller.signal to each fetch attempt,
and do not use ctx.scheduler.runAfter for retries. Ensure the retry logic is
colocated around the existing fetch invocation so synchronous agent-tool
execution still returns a result or a final error.

In `@services/platform/convex/agent_tools/web/helpers/search_pages.ts`:
- Around line 38-46: The outbound fetch to `${crawlerUrl}/api/v1/search` (in
services/platform/convex/agent_tools/web/helpers/search_pages.ts) needs a
timeout and short inline retries: wrap the existing fetch call in an
AbortController with a reasonable timeout (e.g., 2–5s) and implement a small
retry loop (e.g., 2–3 attempts) that retries on network errors, timeouts, or 5xx
responses (do not retry on 4xx), with a short exponential/backoff delay between
attempts; include attempt count and last error/status in the final thrown Error
if all retries fail, and keep using the same payload (args.query, DEFAULT_LIMIT)
when re-sending.
- Around line 35-42: The code calls the crawler even when args.query is
empty/whitespace; before calling getCrawlerServiceUrl()/fetch, validate
args.query by trimming and checking length, and fail fast with a clear
user-facing error/response (e.g., return an error object or throw an Error with
a guidance message like "Please provide a non-empty search query") from the
search_pages helper so no network I/O occurs; place this check immediately after
the debugLog and before constructing crawlerUrl/performing fetch, referencing
args.query, debugLog, getCrawlerServiceUrl, and DEFAULT_LIMIT to locate the spot
to change.
- Around line 38-42: The POST to `${crawlerUrl}/api/v1/search` currently sends
only query and limit (using args.query and DEFAULT_LIMIT) which can leak results
across tenants; update the request payload in the fetch call so it includes the
tenant/site/domain scope (e.g., add tenantId or domain/siteId fields sourced
from the current request context or args) so the crawler can filter by
tenant—modify the JSON.stringify body passed to fetch in this helper (the call
that uses crawlerUrl, args.query, DEFAULT_LIMIT) to include the appropriate
tenant/domain property used by the crawler API.

In `@services/platform/convex/websites/actions.ts`:
- Around line 132-143: The code registers the crawler against website.domain
even when args.domain has changed, leaving the crawler tied to the old domain;
update the logic in the block around registerDomainWithCrawler and the
subsequent patchWebsite so that when args.domain differs from website.domain you
(1) register the new domain with the crawler using
registerDomainWithCrawler(args.domain, args.scanInterval) (or if scanInterval
unchanged, still use args.scanInterval), and (2) deregister the old domain by
calling the corresponding unregister function (e.g.,
unregisterDomainWithCrawler(website.domain)) before or after registration;
ensure you still handle scanInterval changes by using args.scanInterval when
calling registerDomainWithCrawler and keep the
ctx.runMutation(internal.websites.internal_mutations.patchWebsite, {...}) call
to persist the new domain and scanInterval.
- Around line 44-53: The code computes toWebsiteDomain but still passes the raw
args.domain into the provisioning mutation, causing inconsistent persisted
domains; update the payload sent to
internal.websites.internal_mutations.provisionWebsite (the call that assigns
websiteId) to use the normalized toWebsiteDomain value for the domain field (and
any other places in this provisioning block that rely on the domain) so the
stored domain matches the normalized value used for crawler registration.
- Around line 205-210: After retrieving the website with
ctx.runQuery(internal.websites.internal_queries.getWebsite, { websiteId:
args.websiteId }) in the fetchPages, fetchChunks, and searchContent actions, add
a defensive authorization check that verifies the caller (ctx.auth?.userId) is a
member of the website's organization (website.organizationId) before returning
any data; perform a membership lookup via your existing org/membership query
(e.g., call an appropriate internal.organizations/internal_memberships query)
and throw an Unauthorized error if no membership is found so only org members
can access that website's pages/chunks/search.

In `@services/platform/convex/websites/create_website.ts`:
- Around line 18-25: ensureUrl currently only matches lowercase
'http://'/'https://' so inputs with mixed-case schemes (e.g. 'HTTPS://...') get
double-prefixed; update ensureUrl to perform a case-insensitive check (e.g. test
s against /^https?:\/\//i or compare s.toLowerCase().startsWith(...)) and return
the original string when a scheme exists, otherwise prepend 'https://'. Keep
toWebsiteDomain as-is (it should call the fixed ensureUrl) so new
URL(ensureUrl(input)).hostname yields the correct stored domain.

In `@services/platform/convex/websites/internal_actions.ts`:
- Around line 231-237: Validate and clamp the offset and limit before using them
to call the crawler: in the handler where offset and limit are set (variables
offset and limit in the async handler that calls getCrawlerUrl()), enforce
offset >= 0 and limit within a reasonable max (e.g., 1..1000) or throw a
validation error; apply the same fix to the other handler referenced around the
second occurrence (the handler near lines 286-291). Ensure you normalize
incoming args (args.offset, args.limit) and replace downstream uses so only the
validated/clamped values are proxied to the crawler APIs.
- Around line 49-50: The switch that handles scan interval values currently uses
a silent default of "return 21600", which masks invalid inputs; update the
default branch (the switch handling scan intervals where "default: return
21600") to validate unknown values explicitly by either throwing a descriptive
Error (e.g., "Invalid scan interval: <value>") or returning a clearly handled
sentinel (null/undefined) and logging the invalid input so callers must handle
it; ensure the change surfaces invalid inputs instead of silently scheduling a
6-hour interval.
- Around line 130-175: The loop over websites in internal_actions.ts runs
fetchWebsiteInfo and ctx.runMutation serially causing timeouts; change it to
process eligible websites concurrently with a bounded concurrency (worker pool)
to avoid unbounded parallelism: filter websites by
lastStatusSyncAt/SYNC_INTERVAL_MS using now, then schedule tasks that call
fetchWebsiteInfo(website.domain) and on success or error call
ctx.runMutation(internal.websites.internal_mutations.patchWebsite, ...) for that
website; implement a simple concurrency limiter (e.g., queue of promises with a
maxWorkers constant) to run at most N fetch+mutation tasks in parallel and await
completion of all tasks before finishing the action.

In `@services/platform/convex/websites/internal_mutations.ts`:
- Around line 48-54: The pageCount and crawledPageCount fields currently allow
negatives; update their validation so they must be non-negative before calling
WebsitesHelpers.updateWebsite. In the mutation schema (where pageCount and
crawledPageCount are defined) change their validators to enforce >=0 (e.g.,
replace v.number() with a non-negative number validator or add .min(0) / integer
check), or add an explicit pre-check in the handler to throw or clamp negatives
before invoking WebsitesHelpers.updateWebsite; ensure the checks reference the
same field names pageCount and crawledPageCount so invalid values are
rejected/normalized prior to the update.

---

Outside diff comments:
In `@services/platform/convex/websites/bulk_create_websites.ts`:
- Around line 8-33: The loop in bulkCreateWebsites duplicates hostname
normalization by calling new URL(ensureUrl(...)).hostname; replace that with the
shared helper toWebsiteDomain to single-source normalization and avoid drift:
inside bulkCreateWebsites (function bulkCreateWebsites) call
toWebsiteDomain(websiteData.domain) instead of new URL(ensureUrl(...)).hostname
and remove the redundant ensureUrl usage so all create paths use the same
normalization logic.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ae9e49 and a319433.

⛔ Files ignored due to path filters (2)

services/crawler/uv.lock is excluded by !**/*.lock
services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**

📒 Files selected for processing (107)

compose.yml
services/crawler/app/config.py
services/crawler/app/main.py
services/crawler/app/models.py
services/crawler/app/routers/__init__.py
services/crawler/app/routers/crawler.py
services/crawler/app/routers/index.py
services/crawler/app/routers/pages.py
services/crawler/app/routers/search.py
services/crawler/app/routers/websites.py
services/crawler/app/services/chunking_service.py
services/crawler/app/services/database.py
services/crawler/app/services/embedding_service.py
services/crawler/app/services/indexing_service.py
services/crawler/app/services/pg_website_store.py
services/crawler/app/services/scheduler.py
services/crawler/app/services/search_service.py
services/crawler/app/services/website_store.py
services/crawler/app/utils/metadata.py
services/crawler/pyproject.toml
services/crawler/tests/test_chunking_service.py
services/crawler/tests/test_config.py
services/crawler/tests/test_database.py
services/crawler/tests/test_embedding_service.py
services/crawler/tests/test_index_router.py
services/crawler/tests/test_indexing_service.py
services/crawler/tests/test_pages_router.py
services/crawler/tests/test_search_router.py
services/crawler/tests/test_search_service.py
services/crawler/tests/test_website_store.py
services/crawler/tests/test_websites_router.py
services/db/Dockerfile
services/db/docker-entrypoint-wrapper.sh
services/db/init-scripts/01-init-extensions.sql
services/db/init-scripts/01-init-timescaledb.sql
services/db/init-scripts/02-create-convex-database.sql
services/db/init-scripts/03-create-rag-database.sql
services/db/init-scripts/04-create-search-database.sql
services/db/postgresql.conf
services/platform/app/features/automations/utils/step-icons.tsx
services/platform/app/features/websites/components/website-edit-dialog.tsx
services/platform/app/features/websites/components/website-pages-cell.tsx
services/platform/app/features/websites/components/website-pages-dialog.tsx
services/platform/app/features/websites/components/website-row-actions.tsx
services/platform/app/features/websites/components/websites-table.tsx
services/platform/app/features/websites/hooks/mutations.ts
services/platform/app/features/websites/hooks/queries.ts
services/platform/convex/agent_tools/database/helpers/schema_definitions.ts
services/platform/convex/agent_tools/web/helpers/query_web_context.ts
services/platform/convex/agent_tools/web/helpers/search_pages.ts
services/platform/convex/convex.config.ts
services/platform/convex/lib/embedding_config.test.ts
services/platform/convex/lib/embedding_config.ts
services/platform/convex/lib/rls/helpers/rls_rules.ts
services/platform/convex/predefined_workflows/index.ts
services/platform/convex/predefined_workflows/website_scan.ts
services/platform/convex/schema.ts
services/platform/convex/website_page_embeddings/chunk_content.test.ts
services/platform/convex/website_page_embeddings/chunk_content.ts
services/platform/convex/website_page_embeddings/content_hash.test.ts
services/platform/convex/website_page_embeddings/content_hash.ts
services/platform/convex/website_page_embeddings/embedding_pool.ts
services/platform/convex/website_page_embeddings/internal_actions.ts
services/platform/convex/website_page_embeddings/internal_mutations.ts
services/platform/convex/website_page_embeddings/internal_queries.ts
services/platform/convex/website_page_embeddings/rrf.test.ts
services/platform/convex/website_page_embeddings/rrf.ts
services/platform/convex/website_page_embeddings/schema.ts
services/platform/convex/websites/actions.ts
services/platform/convex/websites/bulk_create_websites.ts
services/platform/convex/websites/bulk_upsert_pages.ts
services/platform/convex/websites/cleanup_website.test.ts
services/platform/convex/websites/cleanup_website.ts
services/platform/convex/websites/create_website.ts
services/platform/convex/websites/delete_website.ts
services/platform/convex/websites/get_page_by_url.ts
services/platform/convex/websites/get_pages_by_website.ts
services/platform/convex/websites/get_website_by_domain.ts
services/platform/convex/websites/helpers.ts
services/platform/convex/websites/internal_actions.ts
services/platform/convex/websites/internal_mutations.ts
services/platform/convex/websites/internal_queries.ts
services/platform/convex/websites/list_website_pages_paginated.test.ts
services/platform/convex/websites/list_website_pages_paginated.ts
services/platform/convex/websites/mutations.ts
services/platform/convex/websites/provision_website_scan_workflow.ts
services/platform/convex/websites/queries.ts
services/platform/convex/websites/register_urls.ts
services/platform/convex/websites/rescan_website.ts
services/platform/convex/websites/schema.ts
services/platform/convex/websites/types.ts
services/platform/convex/websites/update_website.ts
services/platform/convex/websites/validators.ts
services/platform/convex/workflow_engine/action_defs/action_registry.ts
services/platform/convex/workflow_engine/action_defs/website/helpers/types.ts
services/platform/convex/workflow_engine/action_defs/website/website_action.ts
services/platform/convex/workflow_engine/action_defs/website_pages/helpers/types.ts
services/platform/convex/workflow_engine/action_defs/website_pages/website_pages_action.ts
services/platform/convex/workflow_engine/action_defs/workflow_processing_records/helpers/types.ts
services/platform/convex/workflow_engine/action_defs/workflow_processing_records/workflow_processing_records_action.ts
services/platform/convex/workflow_engine/helpers/validation/variables/action_schemas.ts
services/platform/convex/workflow_engine/workflow_syntax_compact.ts
services/platform/convex/workflows/processing_records/get_table_indexes.ts
services/platform/convex/workflows/processing_records/internal_mutations.ts
services/platform/convex/workflows/processing_records/types.ts
services/platform/lib/shared/schemas/websites.ts
services/platform/messages/en.json

💤 Files with no reviewable changes (42)

services/platform/convex/lib/rls/helpers/rls_rules.ts
services/platform/convex/workflows/processing_records/get_table_indexes.ts
services/platform/convex/website_page_embeddings/chunk_content.ts
services/platform/convex/websites/rescan_website.ts
services/platform/convex/websites/register_urls.ts
services/platform/convex/workflow_engine/action_defs/action_registry.ts
services/platform/convex/websites/get_page_by_url.ts
services/platform/convex/predefined_workflows/index.ts
services/platform/convex/website_page_embeddings/rrf.ts
services/platform/app/features/automations/utils/step-icons.tsx
services/platform/convex/website_page_embeddings/embedding_pool.ts
services/platform/convex/websites/get_pages_by_website.ts
services/platform/convex/websites/bulk_upsert_pages.ts
services/platform/convex/workflow_engine/action_defs/website_pages/website_pages_action.ts
services/platform/convex/website_page_embeddings/internal_actions.ts
services/platform/convex/workflow_engine/workflow_syntax_compact.ts
services/platform/convex/website_page_embeddings/rrf.test.ts
services/platform/convex/agent_tools/database/helpers/schema_definitions.ts
services/platform/convex/websites/helpers.ts
services/platform/convex/websites/cleanup_website.ts
services/platform/convex/websites/queries.ts
services/platform/convex/websites/list_website_pages_paginated.test.ts
services/platform/convex/workflows/processing_records/internal_mutations.ts
services/platform/convex/website_page_embeddings/schema.ts
services/platform/convex/website_page_embeddings/chunk_content.test.ts
services/platform/convex/website_page_embeddings/content_hash.ts
services/platform/convex/convex.config.ts
services/platform/convex/website_page_embeddings/internal_mutations.ts
services/platform/convex/website_page_embeddings/content_hash.test.ts
services/platform/convex/predefined_workflows/website_scan.ts
services/platform/convex/websites/cleanup_website.test.ts
services/platform/app/features/websites/hooks/queries.ts
services/platform/convex/lib/embedding_config.ts
services/platform/convex/workflow_engine/action_defs/website_pages/helpers/types.ts
services/platform/convex/websites/list_website_pages_paginated.ts
services/platform/convex/workflow_engine/action_defs/workflow_processing_records/workflow_processing_records_action.ts
services/platform/convex/lib/embedding_config.test.ts
services/platform/convex/website_page_embeddings/internal_queries.ts
services/crawler/tests/test_website_store.py
services/db/init-scripts/01-init-timescaledb.sql
services/crawler/app/services/website_store.py
services/platform/convex/websites/provision_website_scan_workflow.ts

coderabbitai · 2026-02-27T16:20:06Z

+    # Dependencies
+    depends_on:
+      - db
+


🧹 Nitpick | 🔵 Trivial

Consider using health-based dependency for reliable startup.

The crawler now requires the database for pg_store_manager initialization. Using depends_on: - db only ensures the container starts, not that PostgreSQL is ready to accept connections. This may cause connection errors during crawler startup.

Proposed fix to wait for healthy database

# Dependencies depends_on: - - db + db: + condition: service_healthy

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Dependencies

depends_on:

- db

# Dependencies

depends_on:

db:

condition: service_healthy

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@compose.yml` around lines 151 - 154, Replace the simple depends_on: - db with a health-based dependency so the crawler waits until PostgreSQL is ready; add a Docker healthcheck to the db service (or ensure it exists) and change the crawler's depends_on entry to use condition: service_healthy (refer to the depends_on key and the db service) so pg_store_manager initialization won't attempt connections before the DB is accepting connections.

coderabbitai · 2026-02-27T16:20:06Z

+    def get_embedding_dimensions(self) -> int:
+        """Get embedding dimensions from CRAWLER_EMBEDDING_DIMENSIONS or EMBEDDING_DIMENSIONS."""
+        dims = self.embedding_dimensions
+        if dims is None:
+            raw = os.environ.get("EMBEDDING_DIMENSIONS")
+            if raw is not None:
+                dims = int(raw)
+        if dims is None:
+            raise ValueError("EMBEDDING_DIMENSIONS must be set in environment.")
+        return dims


⚠️ Potential issue | 🟠 Major

Validate embedding dimensions as a positive integer at config boundary.

Line 101 currently accepts non-positive values (e.g., 0, -1), which are invalid for embedding/vector workflows and fail late downstream.

Suggested hardening patch

def get_embedding_dimensions(self) -> int: """Get embedding dimensions from CRAWLER_EMBEDDING_DIMENSIONS or EMBEDDING_DIMENSIONS.""" dims = self.embedding_dimensions if dims is None: raw = os.environ.get("EMBEDDING_DIMENSIONS") if raw is not None: - dims = int(raw) + try: + dims = int(raw) + except ValueError as exc: + raise ValueError("EMBEDDING_DIMENSIONS must be an integer.") from exc - if dims is None: + if dims is None: raise ValueError("EMBEDDING_DIMENSIONS must be set in environment.") + if dims <= 0: + raise ValueError("EMBEDDING_DIMENSIONS must be a positive integer.") return dims

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/config.py` around lines 95 - 104, The get_embedding_dimensions method currently permits non-positive integers; update get_embedding_dimensions (and its use of self.embedding_dimensions and the EMBEDDING_DIMENSIONS env var) to validate that the resolved value is an integer > 0: when reading from self.embedding_dimensions accept only positive ints, when parsing os.environ["EMBEDDING_DIMENSIONS"] convert with int inside a try/except and raise a clear ValueError for non-integer input, and if the value is <= 0 raise ValueError("EMBEDDING_DIMENSIONS must be a positive integer") so invalid values fail fast at config boundary.

coderabbitai · 2026-02-27T16:20:06Z

+                    LEFT JOIN (
+                        SELECT url, COUNT(*) AS chunks_count
+                        FROM chunks
+                        GROUP BY url
+                    ) c ON c.url = wu.url
+                    WHERE {where_clause}


⚠️ Potential issue | 🟠 Major

Join chunk counts on both domain and url to avoid miscounted indexing state.

At Line 57, the join keys only on url, but chunk storage is domain-scoped. This can inflate/bleed chunks_count across sites and incorrectly set indexed.

Proposed fix

- LEFT JOIN ( - SELECT url, COUNT(*) AS chunks_count - FROM chunks - GROUP BY url - ) c ON c.url = wu.url + LEFT JOIN ( + SELECT domain, url, COUNT(*) AS chunks_count + FROM chunks + GROUP BY domain, url + ) c ON c.domain = wu.domain AND c.url = wu.url

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

LEFT JOIN (

SELECT url, COUNT(*) AS chunks_count

FROM chunks

GROUP BY url

) c ON c.url = wu.url

WHERE {where_clause}

LEFT JOIN (

SELECT domain, url, COUNT(*) AS chunks_count

FROM chunks

GROUP BY domain, url

) c ON c.domain = wu.domain AND c.url = wu.url

WHERE {where_clause}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/pages.py` around lines 53 - 58, The LEFT JOIN that aggregates chunk counts currently groups and joins only by url causing counts to bleed across domains; update the chunks subquery (alias c) to SELECT domain, url, COUNT(*) AS chunks_count and GROUP BY domain, url, and change the JOIN condition to match both c.url = wu.url AND c.domain = wu.domain so chunk counts are domain-scoped when joining to the work unit rows in pages.py.

coderabbitai · 2026-02-27T16:20:06Z

+            if isinstance(sd, str):
+                sd = json.loads(sd)
+            description = extract_meta_description(sd)


⚠️ Potential issue | 🟠 Major

Don’t fail homepage ingestion on malformed structured_data JSON.

A bad JSON payload currently aborts the entire homepage update path. Parse defensively and continue with structured_data=None instead of dropping the crawl result.

Proposed fix

sd = page.get("structured_data") if isinstance(sd, str): - sd = json.loads(sd) + try: + sd = json.loads(sd) + except json.JSONDecodeError: + logger.warning(f"Invalid structured_data JSON for {domain}; ignoring") + sd = None description = extract_meta_description(sd)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if isinstance(sd, str):

sd = json.loads(sd)

description = extract_meta_description(sd)

if isinstance(sd, str):

try:

sd = json.loads(sd)

except json.JSONDecodeError:

logger.warning(f"Invalid structured_data JSON for {domain}; ignoring")

sd = None

description = extract_meta_description(sd)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/websites.py` around lines 53 - 55, The code currently does json.loads(sd) and will raise on malformed JSON, aborting homepage ingestion; wrap the string parse in a try/except (catch json.JSONDecodeError and ValueError) around the json.loads call in the same block that checks isinstance(sd, str), and on failure set sd = None (or structured_data = None if that name is used downstream) and log a debug/warn message so execution proceeds to call extract_meta_description(sd) with sd=None; reference the sd variable and the call to extract_meta_description to locate where to change behavior.

coderabbitai · 2026-02-27T16:20:06Z

+    await asyncio.gather(_crawl_homepage(), _discover_urls())
+
+    await manager.update_last_scanned(domain)
+    await manager.update_scan_status(domain, "active")
+


⚠️ Potential issue | 🟠 Major

Initialization status is forced to active even when background steps fail.

Both subtasks catch and log errors internally, but final status is always set to active. This masks failures and weakens retry/alert behavior.

Proposed fix

async def _initialize_website(domain: str, manager: PgWebsiteStoreManager): @@ + homepage_ok = False + discovery_ok = False + async def _crawl_homepage(): + nonlocal homepage_ok @@ await manager.update_website_metadata( domain=domain, title=title, description=description, page_count=1, ) + homepage_ok = True except Exception: logger.exception(f"Failed to crawl homepage for {domain}") @@ async def _discover_urls(): + nonlocal discovery_ok @@ if discovered: await site_store.save_discovered_urls(discovered) logger.info(f"Discovered {len(discovered)} URLs for {domain}") + discovery_ok = True except Exception: logger.exception(f"URL discovery failed for {domain}") @@ await manager.update_last_scanned(domain) - await manager.update_scan_status(domain, "active") + await manager.update_scan_status(domain, "active" if (homepage_ok or discovery_ok) else "error")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/websites.py` around lines 90 - 94, The code currently awaits asyncio.gather(_crawl_homepage(), _discover_urls()) then unconditionally calls manager.update_scan_status(domain, "active"); change the flow so failure in either background step results in a non-active status: have _crawl_homepage and _discover_urls either return a success boolean or re-raise on fatal error, call results = await asyncio.gather(_crawl_homepage(), _discover_urls(), return_exceptions=True), inspect results for exceptions or any False return values, and only call manager.update_scan_status(domain, "active") (and update_last_scanned(domain)) when all tasks succeeded; otherwise call manager.update_scan_status(domain, "failed") (or an appropriate error status) so failures aren’t masked.

coderabbitai · 2026-02-27T16:20:08Z

+export function ensureUrl(s: string) {
+  return s.startsWith('http://') || s.startsWith('https://')
+    ? s
+    : `https://${s}`;
 }

-function _scanIntervalToCron(interval: string): {
-  schedule: string;
-  timezone: string;
-} {
-  // Default timezone to UTC; can be made configurable later
-  const timezone = 'UTC';
-  switch (interval) {
-    case '60m':
-      return { schedule: '0 * * * *', timezone };
-    case '6h':
-      return { schedule: '0 */6 * * *', timezone };
-    case '12h':
-      return { schedule: '0 */12 * * *', timezone };
-    case '1d':
-      return { schedule: '0 2 * * *', timezone }; // Daily at 02:00 UTC
-    case '5d':
-      return { schedule: '0 2 */5 * *', timezone }; // Every 5 days at 02:00 UTC
-    case '7d':
-      return { schedule: '0 2 */7 * *', timezone }; // Every 7 days at 02:00 UTC
-    case '30d':
-      return { schedule: '0 2 1 * *', timezone }; // Monthly on the 1st at 02:00 UTC
-    default:
-      // Fallback to daily at 02:00 UTC
-      return { schedule: '0 2 * * *', timezone };
-  }
+export function toWebsiteDomain(input: string): string {
+  return new URL(ensureUrl(input)).hostname;


⚠️ Potential issue | 🟠 Major

Fix scheme detection to be case-insensitive before parsing.

Line 19 only matches lowercase schemes. Inputs like HTTPS://example.com become https://HTTPS://example.com, and Line 25 can resolve the hostname incorrectly (https), causing bad stored domains and failed dedupe.

🐛 Proposed fix

export function ensureUrl(s: string) { - return s.startsWith('http://') || s.startsWith('https://') - ? s - : `https://${s}`; + const trimmed = s.trim(); + return /^https?:\/\//i.test(trimmed) ? trimmed : `https://${trimmed}`; }

#!/bin/bash # Verifies current behavior for mixed-case schemes. node - <<'NODE' function ensureUrl(s){return s.startsWith('http://')||s.startsWith('https://')?s:`https://${s}`;} for (const input of ['https://example.com', 'HTTPS://example.com', 'HTTP://example.com']) { console.log(input, '=>', new URL(ensureUrl(input)).hostname); } NODE

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/websites/create_website.ts` around lines 18 - 25, ensureUrl currently only matches lowercase 'http://'/'https://' so inputs with mixed-case schemes (e.g. 'HTTPS://...') get double-prefixed; update ensureUrl to perform a case-insensitive check (e.g. test s against /^https?:\/\//i or compare s.toLowerCase().startsWith(...)) and return the original string when a scheme exists, otherwise prepend 'https://'. Keep toWebsiteDomain as-is (it should call the fixed ensureUrl) so new URL(ensureUrl(input)).hostname yields the correct stored domain.

coderabbitai · 2026-02-27T16:20:08Z

+    default:
+      return 21600;


⚠️ Potential issue | 🟡 Minor

Unknown scan intervals silently map to 6 hours.

This masks invalid input and can schedule unintended crawl frequency.

🎯 Proposed fix

case '30d': return 2592000; default: - return 21600; + throw new Error(`Unsupported scan interval: ${interval}`); } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

default:

return 21600;

default:

throw new Error(`Unsupported scan interval: ${interval}`);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/websites/internal_actions.ts` around lines 49 - 50, The switch that handles scan interval values currently uses a silent default of "return 21600", which masks invalid inputs; update the default branch (the switch handling scan intervals where "default: return 21600") to validate unknown values explicitly by either throwing a descriptive Error (e.g., "Invalid scan interval: <value>") or returning a clearly handled sentinel (null/undefined) and logging the invalid input so callers must handle it; ensure the change surfaces invalid inputs instead of silently scheduling a 6-hour interval.

coderabbitai · 2026-02-27T16:20:09Z

+    for (const website of websites) {
+      const lastSync = website.metadata?.lastStatusSyncAt;
+      if (typeof lastSync === 'number' && now - lastSync < SYNC_INTERVAL_MS) {
+        continue;
+      }
+
+      try {
+        const websiteInfo = await fetchWebsiteInfo(website.domain);
+
+        if (websiteInfo) {
+          await ctx.runMutation(
+            internal.websites.internal_mutations.patchWebsite,
+            {
+              websiteId: website._id,
+              metadata: {
+                ...website.metadata,
+                lastStatusSyncAt: now,
+                lastSyncError: undefined,
+              },
+              status: websiteInfo.status,
+              pageCount: websiteInfo.page_count,
+              crawledPageCount: websiteInfo.crawled_count,
+              title: websiteInfo.title ?? undefined,
+              description: websiteInfo.description ?? undefined,
+              lastScannedAt: websiteInfo.last_scanned_at
+                ? new Date(websiteInfo.last_scanned_at).getTime()
+                : undefined,
+            },
+          );
+        }
+      } catch (error) {
+        const message = error instanceof Error ? error.message : String(error);
+        console.warn(`Failed to sync status for ${website.domain}: ${message}`);
+        await ctx.runMutation(
+          internal.websites.internal_mutations.patchWebsite,
+          {
+            websiteId: website._id,
+            metadata: {
+              ...website.metadata,
+              lastStatusSyncAt: now,
+              lastSyncError: message,
+            },
+          },
+        );
+      }
+    }


⚠️ Potential issue | 🟠 Major

Status sync is fully serial and can time out on larger tenants.

Each website does network I/O + mutation sequentially; this does not scale with action time budgets.

⚡ Proposed fix (bounded concurrency)

+const STATUS_SYNC_CONCURRENCY = 5; @@ - for (const website of websites) { - const lastSync = website.metadata?.lastStatusSyncAt; - if (typeof lastSync === 'number' && now - lastSync < SYNC_INTERVAL_MS) { - continue; - } - // existing sync logic... - } + for (let i = 0; i < websites.length; i += STATUS_SYNC_CONCURRENCY) { + const batch = websites.slice(i, i + STATUS_SYNC_CONCURRENCY); + await Promise.allSettled( + batch.map(async (website) => { + const lastSync = website.metadata?.lastStatusSyncAt; + if (typeof lastSync === 'number' && now - lastSync < SYNC_INTERVAL_MS) return; + // existing sync logic... + }), + ); + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/websites/internal_actions.ts` around lines 130 - 175, The loop over websites in internal_actions.ts runs fetchWebsiteInfo and ctx.runMutation serially causing timeouts; change it to process eligible websites concurrently with a bounded concurrency (worker pool) to avoid unbounded parallelism: filter websites by lastStatusSyncAt/SYNC_INTERVAL_MS using now, then schedule tasks that call fetchWebsiteInfo(website.domain) and on success or error call ctx.runMutation(internal.websites.internal_mutations.patchWebsite, ...) for that website; implement a simple concurrency limiter (e.g., queue of promises with a maxWorkers constant) to run at most N fetch+mutation tasks in parallel and await completion of all tasks before finishing the action.

coderabbitai · 2026-02-27T16:20:09Z

+    offset: v.optional(v.number()),
+    limit: v.optional(v.number()),
+  },
  handler: async (_ctx, args) => {
-    const crawlerUrl = process.env.CRAWLER_URL || 'http://localhost:8002';
-    try {
-      await fetch(`${crawlerUrl}/api/v1/websites/${args.domain}`, {
-        method: 'DELETE',
-      });
-    } catch (e) {
-      console.warn('Failed to deregister website from crawler:', e);
+    const crawlerUrl = getCrawlerUrl();
+    const offset = args.offset ?? 0;
+    const limit = args.limit ?? 100;


⚠️ Potential issue | 🟠 Major

offset/limit are not bounded before proxying to crawler APIs.

Negative or very large values can trigger expensive downstream calls and unstable behavior.

🔒 Proposed bounds

- const offset = args.offset ?? 0; - const limit = args.limit ?? 100; + const offset = Math.max(0, args.offset ?? 0); + const limit = Math.min(200, Math.max(1, args.limit ?? 100)); @@ - const limit = args.limit ?? 10; + const limit = Math.min(50, Math.max(1, args.limit ?? 10));

Also applies to: 286-291

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/websites/internal_actions.ts` around lines 231 - 237, Validate and clamp the offset and limit before using them to call the crawler: in the handler where offset and limit are set (variables offset and limit in the async handler that calls getCrawlerUrl()), enforce offset >= 0 and limit within a reasonable max (e.g., 1..1000) or throw a validation error; apply the same fix to the other handler referenced around the second occurrence (the handler near lines 286-291). Ensure you normalize incoming args (args.offset, args.limit) and replace downstream uses so only the validated/clamped values are proxied to the crawler APIs.

coderabbitai · 2026-02-27T16:20:09Z

+    pageCount: v.optional(v.number()),
+    crawledPageCount: v.optional(v.number()),
    metadata: v.optional(jsonRecordValidator),
  },
  handler: async (ctx, args) => {
    return await WebsitesHelpers.updateWebsite(ctx, args);
  },


⚠️ Potential issue | 🟡 Minor

Validate pageCount and crawledPageCount as non-negative before update.

These new fields currently accept negative values, which can produce invalid website stats.

Proposed fix

handler: async (ctx, args) => { + if (args.pageCount !== undefined && args.pageCount < 0) { + throw new Error('pageCount must be non-negative'); + } + if (args.crawledPageCount !== undefined && args.crawledPageCount < 0) { + throw new Error('crawledPageCount must be non-negative'); + } return await WebsitesHelpers.updateWebsite(ctx, args); },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pageCount: v.optional(v.number()),

crawledPageCount: v.optional(v.number()),

metadata: v.optional(jsonRecordValidator),

},

handler: async (ctx, args) => {

return await WebsitesHelpers.updateWebsite(ctx, args);

},

pageCount: v.optional(v.number()),

crawledPageCount: v.optional(v.number()),

metadata: v.optional(jsonRecordValidator),

},

handler: async (ctx, args) => {

if (args.pageCount !== undefined && args.pageCount < 0) {

throw new Error('pageCount must be non-negative');

}

if (args.crawledPageCount !== undefined && args.crawledPageCount < 0) {

throw new Error('crawledPageCount must be non-negative');

}

return await WebsitesHelpers.updateWebsite(ctx, args);

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/websites/internal_mutations.ts` around lines 48 - 54, The pageCount and crawledPageCount fields currently allow negatives; update their validation so they must be non-negative before calling WebsitesHelpers.updateWebsite. In the mutation schema (where pageCount and crawledPageCount are defined) change their validators to enforce >=0 (e.g., replace v.number() with a non-negative number validator or add .min(0) / integer check), or add an explicit pre-check in the handler to throw or clamp negatives before invoking WebsitesHelpers.updateWebsite; ensure the checks reference the same field names pageCount and crawledPageCount so invalid values are rejected/normalized prior to the update.

…ex reliability The embedding column starts as untyped `vector` since dimensions are configurable. This caused HNSW index creation to fail on empty tables. Now init_pool pins the column to an explicit dimension via ALTER TABLE before creating the index, and re-pins if the configured dimension changes while the table is empty. A SQL-side guard also rejects bare `vector` columns to prevent silent index failures.

- Sort imports in crawler models.py (ruff I001) - Add strict=True to zip() in scheduler.py (ruff B905) - Remove invalid padding/gap props from BorderedSection usage - Remove unused @ai-sdk/provider dependency

…rs in queries URLs and other inputs containing colons, parentheses, or slashes caused Tantivy query parser errors. Switch from raw string @@@ to structured paradedb.match() which bypasses the string parser entirely.

larryro added 5 commits February 28, 2026 00:14

refactor(db): rename search database from tale_crawler_search to tale…

53abe48

…_search

larryro force-pushed the feat/migrate-search-to-crawler branch from a319433 to ebe30b2 Compare February 27, 2026 16:14

coderabbitai Bot requested changes Feb 27, 2026

View reviewed changes

larryro added 4 commits February 28, 2026 00:47

fix: resolve CI lint, typecheck, and knip failures

58b79f1

- Sort imports in crawler models.py (ruff I001) - Add strict=True to zip() in scheduler.py (ruff B905) - Remove invalid padding/gap props from BorderedSection usage - Remove unused @ai-sdk/provider dependency

fix: format services/platform/package.json

b628a56

larryro merged commit 3af4492 into main Feb 27, 2026
17 checks passed

larryro deleted the feat/migrate-search-to-crawler branch February 27, 2026 17:10

This was referenced Feb 28, 2026

feat(crawler,db): add cross-page paragraph deduplication for boilerplate filtering #604

Merged

Replace Cognee with lightweight RAG pipeline #607

Merged

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

feat: migrate website search and embeddings to crawler service (#595)

4a9244e

coderabbitai Bot mentioned this pull request Apr 10, 2026

feat(platform): budget limits per user/group #1311

Merged

10 tasks

coderabbitai Bot mentioned this pull request Apr 24, 2026

refactor(db): split service migrations and add chunk overlap fields #1615

Merged

6 tasks

Conversation

larryro commented Feb 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

greptile-apps Bot commented Feb 27, 2026

Uh oh!

coderabbitai Bot commented Feb 27, 2026

Code Review Summary

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Feb 27, 2026 •

edited by coderabbitai Bot

Loading