feat: migrate website search and embeddings to crawler service#595
Conversation
|
Too many files changed for review. ( |
…rvice Move search indexing and embedding storage from Convex to the crawler service backed by PostgreSQL + pgvector. Make all DB init scripts idempotent and run them on every container startup. Drop legacy TimescaleDB extension, ensure crawler upserts website_urls before inserting chunks, and add transactional cascading deletes when removing a website.
Add semantic search and chunk viewer to the website pages dialog. Make website registration non-blocking with background homepage crawl and scheduled metadata sync. Track crawled vs total page counts separately. Remove the standalone rescan action in favor of scheduler-driven scans. Improve reliability with exponential backoff retries, HNSW index auto-creation, scan cancellation, domain normalization, parameterized sort columns, and proper URL encoding.
…kdownSplitter Swap the hand-rolled fixed-size chunking logic for the semantic-text-splitter library's MarkdownSplitter, which splits at structural markdown boundaries (headers, code blocks, tables, paragraphs, sentences) while respecting the target chunk size. This removes ~120 lines of manual paragraph/sentence/hard-split code and the abbreviation-aware sentence splitter. Tests updated to validate markdown-aware behaviour (header boundaries, code blocks, tables) and relaxed implementation-detail assertions.
Make embedding dimensions required via env var with startup validation to prevent silent dimension mismatches. Improve error visibility across crawler (background task logging, HNSW warnings) and platform (error toasts, sync error tracking in metadata). Fix website delete ordering to deregister from crawler before removing DB record.
a319433 to
ebe30b2
Compare
📝 WalkthroughCode Review SummaryWalkthroughThis pull request performs a comprehensive architectural migration of the website crawler and search infrastructure. The crawler service gains PostgreSQL-backed persistence (replacing SQLite), adds new services for content embedding, indexing, and hybrid search, and exposes additional REST API endpoints for indexing and search operations. The platform backend removes its internal website page embeddings model and websitePages table, replacing them with external API calls to the crawler service for search, page listing, and synchronization operations. The database configuration shifts from TimescaleDB to ParadeDB for enhanced full-text and vector search capabilities. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes The diff introduces significant structural changes across multiple service boundaries: new asynchronous database integration and service orchestration in the crawler, replacement of a large embedded embedding/search subsystem in the platform, and refactoring of website synchronization and persistence patterns. The changes span heterogeneous components (database schema, multiple new services with interdependencies, API endpoint wiring, and removal of legacy functionality), necessitating careful verification of service initialization order, database consistency, external API contract alignment, and data flow correctness. Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 37
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
services/platform/convex/websites/bulk_create_websites.ts (1)
8-33: 🧹 Nitpick | 🔵 TrivialReuse
toWebsiteDomainto keep normalization logic single-sourced.Bulk create still duplicates hostname extraction in Line 32. Calling the shared domain helper keeps behavior consistent across create paths and avoids future drift.
♻️ Proposed refactor
-import { ensureUrl } from './create_website'; +import { toWebsiteDomain } from './create_website'; @@ - const normalized = new URL(ensureUrl(websiteData.domain)).hostname; + const normalized = toWebsiteDomain(websiteData.domain);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/websites/bulk_create_websites.ts` around lines 8 - 33, The loop in bulkCreateWebsites duplicates hostname normalization by calling new URL(ensureUrl(...)).hostname; replace that with the shared helper toWebsiteDomain to single-source normalization and avoid drift: inside bulkCreateWebsites (function bulkCreateWebsites) call toWebsiteDomain(websiteData.domain) instead of new URL(ensureUrl(...)).hostname and remove the redundant ensureUrl usage so all create paths use the same normalization logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compose.yml`:
- Around line 151-154: Replace the simple depends_on: - db with a health-based
dependency so the crawler waits until PostgreSQL is ready; add a Docker
healthcheck to the db service (or ensure it exists) and change the crawler's
depends_on entry to use condition: service_healthy (refer to the depends_on key
and the db service) so pg_store_manager initialization won't attempt connections
before the DB is accepting connections.
In `@services/crawler/app/config.py`:
- Around line 95-104: The get_embedding_dimensions method currently permits
non-positive integers; update get_embedding_dimensions (and its use of
self.embedding_dimensions and the EMBEDDING_DIMENSIONS env var) to validate that
the resolved value is an integer > 0: when reading from
self.embedding_dimensions accept only positive ints, when parsing
os.environ["EMBEDDING_DIMENSIONS"] convert with int inside a try/except and
raise a clear ValueError for non-integer input, and if the value is <= 0 raise
ValueError("EMBEDDING_DIMENSIONS must be a positive integer") so invalid values
fail fast at config boundary.
In `@services/crawler/app/routers/pages.py`:
- Around line 53-58: The LEFT JOIN that aggregates chunk counts currently groups
and joins only by url causing counts to bleed across domains; update the chunks
subquery (alias c) to SELECT domain, url, COUNT(*) AS chunks_count and GROUP BY
domain, url, and change the JOIN condition to match both c.url = wu.url AND
c.domain = wu.domain so chunk counts are domain-scoped when joining to the work
unit rows in pages.py.
In `@services/crawler/app/routers/websites.py`:
- Around line 53-55: The code currently does json.loads(sd) and will raise on
malformed JSON, aborting homepage ingestion; wrap the string parse in a
try/except (catch json.JSONDecodeError and ValueError) around the json.loads
call in the same block that checks isinstance(sd, str), and on failure set sd =
None (or structured_data = None if that name is used downstream) and log a
debug/warn message so execution proceeds to call extract_meta_description(sd)
with sd=None; reference the sd variable and the call to extract_meta_description
to locate where to change behavior.
- Around line 90-94: The code currently awaits asyncio.gather(_crawl_homepage(),
_discover_urls()) then unconditionally calls manager.update_scan_status(domain,
"active"); change the flow so failure in either background step results in a
non-active status: have _crawl_homepage and _discover_urls either return a
success boolean or re-raise on fatal error, call results = await
asyncio.gather(_crawl_homepage(), _discover_urls(), return_exceptions=True),
inspect results for exceptions or any False return values, and only call
manager.update_scan_status(domain, "active") (and update_last_scanned(domain))
when all tasks succeeded; otherwise call manager.update_scan_status(domain,
"failed") (or an appropriate error status) so failures aren’t masked.
In `@services/crawler/app/services/database.py`:
- Around line 22-23: Replace the implicit default weak DB password fallback with
a fail-fast check: if DATABASE_URL is provided return it directly; otherwise
require DB_PASSWORD to be set and raise a clear exception if it is missing
instead of defaulting to "tale_password_change_me". Update the code that reads
os.environ.get("DB_PASSWORD", ...) and the connection-string construction (the
variable password and the return
f"postgresql://tale:{password}@db:5432/tale_search") to validate presence and
raise an error with a helpful message so misconfiguration is detected at
startup.
- Around line 31-49: The pool created by asyncpg.create_pool is assigned to the
global _pool before post-creation checks, so any exception during initialization
(e.g., in settings.get_embedding_dimensions or the SELECT) can leave _pool open;
wrap the post-creation initialization and the embedding-dimension validation in
a try/except/finally (or try/except) so that on any exception you await
_pool.close() and set _pool = None before re-raising; apply this around the
block that uses _pool, including the stored_dims fetch and the configured_dims
comparison, referencing _get_database_url, asyncpg.create_pool, _pool,
settings.get_embedding_dimensions, and the stored_dims validation.
In `@services/crawler/app/services/embedding_service.py`:
- Around line 35-40: The code returns embeddings assuming response.data length
matches the input batch; add a cardinality guard after calling
self._client.embeddings.create to verify len(response.data) == len(batch) (where
response.data and batch are the unique symbols here) and if they differ raise a
clear exception (e.g., ValueError) or log and raise so downstream
chunk-to-embedding alignment cannot silently drift; keep the original return of
[item.embedding for item in response.data] only after the check passes.
In `@services/crawler/app/services/indexing_service.py`:
- Around line 66-72: The code builds the INSERT payload by indexing into
embeddings[i] without verifying embeddings length; before creating the list
comprehension used in the conn.executemany call, check that len(embeddings) ==
len(chunks) (or handle a mismatch explicitly), and if they differ, log or raise
a clear error and abort the insert to avoid an unhandled IndexError; update the
block around embeddings and chunks (the variables used in the INSERT preparing
logic) to validate cardinality and fail fast with a descriptive message or
handle the discrepancy according to the intended behavior (e.g., skip, pad, or
retry).
In `@services/crawler/app/services/pg_website_store.py`:
- Around line 134-135: The current conditional json.dumps(u["metadata"]) if
u.get("metadata") else None (and similarly for structured_data) treats empty
objects/arrays as falsy and converts them to NULL; change the condition to
preserve empty JSON by checking existence/non-null instead (e.g., use
json.dumps(u["metadata"]) if "metadata" in u or if u.get("metadata") is not
None, and the same for u["structured_data"]) so empty {} or [] are serialized
rather than dropped.
In `@services/crawler/app/services/scheduler.py`:
- Around line 214-226: The cancellation flag is being cleared too early by
_clear_cancelled(domain), so a cancellation requested just before scan start is
ignored; change the logic to check _is_cancelled(domain) before calling
_clear_cancelled(domain) (i.e., remove or move the early _clear_cancelled call
and only clear the queued cancellation after you've detected and handled the
cancellation or at the very end of the scan). Update the three occurrences
involving _clear_cancelled/_is_cancelled in scheduler.py (the initial block
around crawler_service.initialize and the later checkpoints referenced in the
review) so each checkpoint first calls _is_cancelled(domain) and handles/returns
accordingly, and only calls _clear_cancelled(domain) when acknowledging and
finishing cancellation or after completion.
- Around line 136-140: The HEAD-check AsyncClient is disabling TLS verification
(verify=False) which is unsafe; update the AsyncClient creation used for HEAD
checks (the async with httpx.AsyncClient(...) as client blocks referencing
_HEAD_TIMEOUT and client) to remove verify=False or set verify=True so TLS certs
are validated, and apply the same change to the other AsyncClient usage
mentioned (the second block around the other HEAD check). Ensure these changes
only affect the HEAD-check client creation in scheduler.py.
- Around line 294-296: The code currently calls json.loads(sd) unguarded which
can raise on malformed structured data and abort the scan; wrap the conversion
in a try/except catching json.JSONDecodeError (and a generic Exception
fallback), log or warn about the malformed payload, and set sd to a safe default
(e.g., {} or None) before calling extract_meta_description so the scan
continues; make this change around the existing isinstance(sd, str) branch that
handles sd and the subsequent call to extract_meta_description to ensure
malformed JSON doesn’t crash the scheduler.
In `@services/crawler/app/services/search_service.py`:
- Around line 76-102: The code serializes embeddings with json.dumps and relies
on casting in SQL; instead register asyncpg's pgvector/vector codec when
creating the connection pool (so pgvector values are sent/received natively) and
then modify _vector_search to pass the embedding list directly (remove vec_str =
json.dumps(embedding) and use embedding as the parameter) while keeping the same
SQL (with $1::vector cast if desired) so asyncpg will encode the Python list
into the pgvector type automatically; update pool creation/registration logic
(where self._pool is created) to call the asyncpg pgvector adapter/codec
registration so the driver knows how to handle vector values.
In `@services/crawler/pyproject.toml`:
- Around line 21-23: The dependency entries "asyncpg", "tiktoken", and
"semantic-text-splitter" use minimum version specifiers (>=) which creates
non-reproducible builds; update these entries in pyproject.toml to use pinned,
exact versions (==) consistent with the rest of the file—e.g., change
"asyncpg>=0.30.0", "tiktoken>=0.9.0", and "semantic-text-splitter>=0.18.0" to
explicit pinned versions (choose the tested/latest known-good patch versions) so
the project consistently uses exact pins across dependencies.
In `@services/crawler/tests/test_chunking_service.py`:
- Around line 141-150: The current assertion in
test_overlap_produces_shared_content is ineffective; change it to verify actual
shared content between adjacent chunks by checking for overlapping sentence
tokens rather than string concatenation length. Update the test to iterate
adjacent pairs from chunk_content(...) and assert that there exists at least one
sentence identifier (e.g., the substring "Sentence number X" or split by
sentences) that appears in both result[i].content and result[i+1].content;
reference the test function test_overlap_produces_shared_content and the
chunk_content call to locate the change.
In `@services/crawler/tests/test_config.py`:
- Around line 88-109: Add tests in TestGetEmbeddingDimensions that exercise
get_embedding_dimensions() with invalid env values: create one test that sets
EMBEDDING_DIMENSIONS (and another that sets CRAWLER_EMBEDDING_DIMENSIONS) to a
non-integer string like "abc" and asserts it raises ValueError (match
"EMBEDDING_DIMENSIONS"), and another test that sets each to non-positive values
("0" and "-1") and asserts ValueError; use the same patch.dict(os.environ, env,
clear=True) pattern and _base_env() setup so both prefixed
(CRAWLER_EMBEDDING_DIMENSIONS) and generic (EMBEDDING_DIMENSIONS) cases are
covered.
In `@services/crawler/tests/test_pages_router.py`:
- Around line 13-14: Replace uses of timezone.utc with the datetime.UTC alias
for the constants _DEFAULT_CRAWLED and _DEFAULT_DISCOVERED: change
tzinfo=timezone.utc to tzinfo=datetime.UTC in the datetime(...) calls, and
update imports so datetime.UTC is available (e.g., import datetime as datetime
or adjust from datetime import datetime and reference datetime.UTC accordingly).
In `@services/crawler/tests/test_search_service.py`:
- Around line 172-176: The test asserts a last-write-wins merge strategy but the
production code lacks documentation; add a concise comment or docstring inside
SearchService._merge_rrf stating that when multiple result lists contain the
same item id, later lists overwrite earlier item metadata (i.e.,
last-write-wins), so future maintainers understand this intentional behavior and
the contract relied upon by tests like test_later_list_overwrites_item_metadata.
In `@services/db/docker-entrypoint-wrapper.sh`:
- Around line 91-97: The schema/init step is run in the background causing a
readiness race; change the wrapper so that after pg_isready confirms Postgres is
accepting connections it calls run_init_scripts synchronously (not in a
background subshell) and only returns/exits once run_init_scripts completes,
ensuring schema/index creation finishes before the container signals readiness;
locate the block using pg_isready and run_init_scripts in
docker-entrypoint-wrapper.sh and remove the trailing '&' and surrounding
background subshell so the init runs in the foreground.
- Around line 86-87: The init script is swallowing failures because the pipeline
`psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -f "$script" 2>&1 | grep -E
"^(ERROR|NOTICE)" || true` forces a zero exit even when migrations fail; change
the logic so that `psql` errors cause a non-zero exit: run `psql -f "$script"`
capturing stdout/stderr, detect lines that start with "ERROR" (or simply rely on
psql's exit code) for the given "$script" and if any error is found exit with a
non-zero status and log the details (don’t use `|| true`), ensuring failures
during processing of "$script" bubble up and stop container startup.
In `@services/db/init-scripts/01-init-extensions.sql`:
- Around line 21-24: The current GRANT statements only affect existing objects
in schema "tale" (GRANT ALL PRIVILEGES ON SCHEMA tale; ON ALL TABLES IN SCHEMA
tale; ON ALL SEQUENCES IN SCHEMA tale TO CURRENT_USER), so add corresponding
default-privilege statements to ensure tables and sequences created later
inherit the same rights: run ALTER DEFAULT PRIVILEGES IN SCHEMA tale ... GRANT
ALL PRIVILEGES ON TABLES TO CURRENT_USER and ALTER DEFAULT PRIVILEGES IN SCHEMA
tale ... GRANT ALL PRIVILEGES ON SEQUENCES TO CURRENT_USER (use the same schema
"tale" and principal CURRENT_USER to match the existing grants).
In `@services/db/init-scripts/02-create-convex-database.sql`:
- Line 12: The script currently uses "DROP EXTENSION IF EXISTS timescaledb
CASCADE;" which can unintentionally drop dependent objects and data; update the
statement that drops the TimescaleDB extension (the DROP EXTENSION for
timescaledb) to remove the CASCADE keyword (i.e., use a non-cascading drop or
skip dropping entirely), so the startup script either uses "DROP EXTENSION IF
EXISTS timescaledb" without CASCADE or avoids dropping the extension at runtime
to preserve dependent objects and data.
In `@services/db/init-scripts/03-create-rag-database.sql`:
- Line 12: The startup script currently executes a destructive DROP EXTENSION IF
EXISTS timescaledb CASCADE which can remove dependent objects; remove that DROP
statement (the line containing "DROP EXTENSION IF EXISTS timescaledb CASCADE;")
from the init path and replace it with a non-destructive approach: ensure the
extension is created with a safe idempotent call (e.g., use a
create-if-not-exists flow) or move any DROP logic into an explicit migration
script that includes pre-checks, backups and operator approval; if you need to
perform version changes, implement them as separate migrations that check for
dependent objects and require manual approval rather than dropping the extension
during startup.
In `@services/platform/app/features/websites/components/websites-table.tsx`:
- Around line 37-44: The background sync in the useEffect block (key
`websites-sync-${organizationId}` using sessionStorage) calls syncStatuses({
organizationId }) fire-and-forget; add minimal error handling by awaiting or
attaching a .catch to syncStatuses and logging failures (via existing logger or
console.error) and optionally schedule a single silent retry/backoff (e.g.,
setTimeout) to avoid leaving stale status data; keep the 5-minute cooldown logic
intact and ensure errors do not throw to UI by swallowing them after logging.
In `@services/platform/convex/agent_tools/web/helpers/query_web_context.ts`:
- Around line 58-65: Wrap the single fetch call to the crawler (the block using
getCrawlerServiceUrl(), controller.signal, query and limit) in a short inline
retry loop: attempt the POST up to a few times (e.g., 3 attempts) with a short
awaitable delay (~1s) between attempts, catching transient errors (network
failures, timeouts, and HTTP 5xx/429 responses) and only rethrowing after
exhausting retries; preserve and pass controller.signal to each fetch attempt,
and do not use ctx.scheduler.runAfter for retries. Ensure the retry logic is
colocated around the existing fetch invocation so synchronous agent-tool
execution still returns a result or a final error.
In `@services/platform/convex/agent_tools/web/helpers/search_pages.ts`:
- Around line 38-46: The outbound fetch to `${crawlerUrl}/api/v1/search` (in
services/platform/convex/agent_tools/web/helpers/search_pages.ts) needs a
timeout and short inline retries: wrap the existing fetch call in an
AbortController with a reasonable timeout (e.g., 2–5s) and implement a small
retry loop (e.g., 2–3 attempts) that retries on network errors, timeouts, or 5xx
responses (do not retry on 4xx), with a short exponential/backoff delay between
attempts; include attempt count and last error/status in the final thrown Error
if all retries fail, and keep using the same payload (args.query, DEFAULT_LIMIT)
when re-sending.
- Around line 35-42: The code calls the crawler even when args.query is
empty/whitespace; before calling getCrawlerServiceUrl()/fetch, validate
args.query by trimming and checking length, and fail fast with a clear
user-facing error/response (e.g., return an error object or throw an Error with
a guidance message like "Please provide a non-empty search query") from the
search_pages helper so no network I/O occurs; place this check immediately after
the debugLog and before constructing crawlerUrl/performing fetch, referencing
args.query, debugLog, getCrawlerServiceUrl, and DEFAULT_LIMIT to locate the spot
to change.
- Around line 38-42: The POST to `${crawlerUrl}/api/v1/search` currently sends
only query and limit (using args.query and DEFAULT_LIMIT) which can leak results
across tenants; update the request payload in the fetch call so it includes the
tenant/site/domain scope (e.g., add tenantId or domain/siteId fields sourced
from the current request context or args) so the crawler can filter by
tenant—modify the JSON.stringify body passed to fetch in this helper (the call
that uses crawlerUrl, args.query, DEFAULT_LIMIT) to include the appropriate
tenant/domain property used by the crawler API.
In `@services/platform/convex/websites/actions.ts`:
- Around line 132-143: The code registers the crawler against website.domain
even when args.domain has changed, leaving the crawler tied to the old domain;
update the logic in the block around registerDomainWithCrawler and the
subsequent patchWebsite so that when args.domain differs from website.domain you
(1) register the new domain with the crawler using
registerDomainWithCrawler(args.domain, args.scanInterval) (or if scanInterval
unchanged, still use args.scanInterval), and (2) deregister the old domain by
calling the corresponding unregister function (e.g.,
unregisterDomainWithCrawler(website.domain)) before or after registration;
ensure you still handle scanInterval changes by using args.scanInterval when
calling registerDomainWithCrawler and keep the
ctx.runMutation(internal.websites.internal_mutations.patchWebsite, {...}) call
to persist the new domain and scanInterval.
- Around line 44-53: The code computes toWebsiteDomain but still passes the raw
args.domain into the provisioning mutation, causing inconsistent persisted
domains; update the payload sent to
internal.websites.internal_mutations.provisionWebsite (the call that assigns
websiteId) to use the normalized toWebsiteDomain value for the domain field (and
any other places in this provisioning block that rely on the domain) so the
stored domain matches the normalized value used for crawler registration.
- Around line 205-210: After retrieving the website with
ctx.runQuery(internal.websites.internal_queries.getWebsite, { websiteId:
args.websiteId }) in the fetchPages, fetchChunks, and searchContent actions, add
a defensive authorization check that verifies the caller (ctx.auth?.userId) is a
member of the website's organization (website.organizationId) before returning
any data; perform a membership lookup via your existing org/membership query
(e.g., call an appropriate internal.organizations/internal_memberships query)
and throw an Unauthorized error if no membership is found so only org members
can access that website's pages/chunks/search.
In `@services/platform/convex/websites/create_website.ts`:
- Around line 18-25: ensureUrl currently only matches lowercase
'http://'/'https://' so inputs with mixed-case schemes (e.g. 'HTTPS://...') get
double-prefixed; update ensureUrl to perform a case-insensitive check (e.g. test
s against /^https?:\/\//i or compare s.toLowerCase().startsWith(...)) and return
the original string when a scheme exists, otherwise prepend 'https://'. Keep
toWebsiteDomain as-is (it should call the fixed ensureUrl) so new
URL(ensureUrl(input)).hostname yields the correct stored domain.
In `@services/platform/convex/websites/internal_actions.ts`:
- Around line 231-237: Validate and clamp the offset and limit before using them
to call the crawler: in the handler where offset and limit are set (variables
offset and limit in the async handler that calls getCrawlerUrl()), enforce
offset >= 0 and limit within a reasonable max (e.g., 1..1000) or throw a
validation error; apply the same fix to the other handler referenced around the
second occurrence (the handler near lines 286-291). Ensure you normalize
incoming args (args.offset, args.limit) and replace downstream uses so only the
validated/clamped values are proxied to the crawler APIs.
- Around line 49-50: The switch that handles scan interval values currently uses
a silent default of "return 21600", which masks invalid inputs; update the
default branch (the switch handling scan intervals where "default: return
21600") to validate unknown values explicitly by either throwing a descriptive
Error (e.g., "Invalid scan interval: <value>") or returning a clearly handled
sentinel (null/undefined) and logging the invalid input so callers must handle
it; ensure the change surfaces invalid inputs instead of silently scheduling a
6-hour interval.
- Around line 130-175: The loop over websites in internal_actions.ts runs
fetchWebsiteInfo and ctx.runMutation serially causing timeouts; change it to
process eligible websites concurrently with a bounded concurrency (worker pool)
to avoid unbounded parallelism: filter websites by
lastStatusSyncAt/SYNC_INTERVAL_MS using now, then schedule tasks that call
fetchWebsiteInfo(website.domain) and on success or error call
ctx.runMutation(internal.websites.internal_mutations.patchWebsite, ...) for that
website; implement a simple concurrency limiter (e.g., queue of promises with a
maxWorkers constant) to run at most N fetch+mutation tasks in parallel and await
completion of all tasks before finishing the action.
In `@services/platform/convex/websites/internal_mutations.ts`:
- Around line 48-54: The pageCount and crawledPageCount fields currently allow
negatives; update their validation so they must be non-negative before calling
WebsitesHelpers.updateWebsite. In the mutation schema (where pageCount and
crawledPageCount are defined) change their validators to enforce >=0 (e.g.,
replace v.number() with a non-negative number validator or add .min(0) / integer
check), or add an explicit pre-check in the handler to throw or clamp negatives
before invoking WebsitesHelpers.updateWebsite; ensure the checks reference the
same field names pageCount and crawledPageCount so invalid values are
rejected/normalized prior to the update.
---
Outside diff comments:
In `@services/platform/convex/websites/bulk_create_websites.ts`:
- Around line 8-33: The loop in bulkCreateWebsites duplicates hostname
normalization by calling new URL(ensureUrl(...)).hostname; replace that with the
shared helper toWebsiteDomain to single-source normalization and avoid drift:
inside bulkCreateWebsites (function bulkCreateWebsites) call
toWebsiteDomain(websiteData.domain) instead of new URL(ensureUrl(...)).hostname
and remove the redundant ensureUrl usage so all create paths use the same
normalization logic.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (2)
services/crawler/uv.lockis excluded by!**/*.lockservices/platform/convex/_generated/api.d.tsis excluded by!**/_generated/**
📒 Files selected for processing (107)
compose.ymlservices/crawler/app/config.pyservices/crawler/app/main.pyservices/crawler/app/models.pyservices/crawler/app/routers/__init__.pyservices/crawler/app/routers/crawler.pyservices/crawler/app/routers/index.pyservices/crawler/app/routers/pages.pyservices/crawler/app/routers/search.pyservices/crawler/app/routers/websites.pyservices/crawler/app/services/chunking_service.pyservices/crawler/app/services/database.pyservices/crawler/app/services/embedding_service.pyservices/crawler/app/services/indexing_service.pyservices/crawler/app/services/pg_website_store.pyservices/crawler/app/services/scheduler.pyservices/crawler/app/services/search_service.pyservices/crawler/app/services/website_store.pyservices/crawler/app/utils/metadata.pyservices/crawler/pyproject.tomlservices/crawler/tests/test_chunking_service.pyservices/crawler/tests/test_config.pyservices/crawler/tests/test_database.pyservices/crawler/tests/test_embedding_service.pyservices/crawler/tests/test_index_router.pyservices/crawler/tests/test_indexing_service.pyservices/crawler/tests/test_pages_router.pyservices/crawler/tests/test_search_router.pyservices/crawler/tests/test_search_service.pyservices/crawler/tests/test_website_store.pyservices/crawler/tests/test_websites_router.pyservices/db/Dockerfileservices/db/docker-entrypoint-wrapper.shservices/db/init-scripts/01-init-extensions.sqlservices/db/init-scripts/01-init-timescaledb.sqlservices/db/init-scripts/02-create-convex-database.sqlservices/db/init-scripts/03-create-rag-database.sqlservices/db/init-scripts/04-create-search-database.sqlservices/db/postgresql.confservices/platform/app/features/automations/utils/step-icons.tsxservices/platform/app/features/websites/components/website-edit-dialog.tsxservices/platform/app/features/websites/components/website-pages-cell.tsxservices/platform/app/features/websites/components/website-pages-dialog.tsxservices/platform/app/features/websites/components/website-row-actions.tsxservices/platform/app/features/websites/components/websites-table.tsxservices/platform/app/features/websites/hooks/mutations.tsservices/platform/app/features/websites/hooks/queries.tsservices/platform/convex/agent_tools/database/helpers/schema_definitions.tsservices/platform/convex/agent_tools/web/helpers/query_web_context.tsservices/platform/convex/agent_tools/web/helpers/search_pages.tsservices/platform/convex/convex.config.tsservices/platform/convex/lib/embedding_config.test.tsservices/platform/convex/lib/embedding_config.tsservices/platform/convex/lib/rls/helpers/rls_rules.tsservices/platform/convex/predefined_workflows/index.tsservices/platform/convex/predefined_workflows/website_scan.tsservices/platform/convex/schema.tsservices/platform/convex/website_page_embeddings/chunk_content.test.tsservices/platform/convex/website_page_embeddings/chunk_content.tsservices/platform/convex/website_page_embeddings/content_hash.test.tsservices/platform/convex/website_page_embeddings/content_hash.tsservices/platform/convex/website_page_embeddings/embedding_pool.tsservices/platform/convex/website_page_embeddings/internal_actions.tsservices/platform/convex/website_page_embeddings/internal_mutations.tsservices/platform/convex/website_page_embeddings/internal_queries.tsservices/platform/convex/website_page_embeddings/rrf.test.tsservices/platform/convex/website_page_embeddings/rrf.tsservices/platform/convex/website_page_embeddings/schema.tsservices/platform/convex/websites/actions.tsservices/platform/convex/websites/bulk_create_websites.tsservices/platform/convex/websites/bulk_upsert_pages.tsservices/platform/convex/websites/cleanup_website.test.tsservices/platform/convex/websites/cleanup_website.tsservices/platform/convex/websites/create_website.tsservices/platform/convex/websites/delete_website.tsservices/platform/convex/websites/get_page_by_url.tsservices/platform/convex/websites/get_pages_by_website.tsservices/platform/convex/websites/get_website_by_domain.tsservices/platform/convex/websites/helpers.tsservices/platform/convex/websites/internal_actions.tsservices/platform/convex/websites/internal_mutations.tsservices/platform/convex/websites/internal_queries.tsservices/platform/convex/websites/list_website_pages_paginated.test.tsservices/platform/convex/websites/list_website_pages_paginated.tsservices/platform/convex/websites/mutations.tsservices/platform/convex/websites/provision_website_scan_workflow.tsservices/platform/convex/websites/queries.tsservices/platform/convex/websites/register_urls.tsservices/platform/convex/websites/rescan_website.tsservices/platform/convex/websites/schema.tsservices/platform/convex/websites/types.tsservices/platform/convex/websites/update_website.tsservices/platform/convex/websites/validators.tsservices/platform/convex/workflow_engine/action_defs/action_registry.tsservices/platform/convex/workflow_engine/action_defs/website/helpers/types.tsservices/platform/convex/workflow_engine/action_defs/website/website_action.tsservices/platform/convex/workflow_engine/action_defs/website_pages/helpers/types.tsservices/platform/convex/workflow_engine/action_defs/website_pages/website_pages_action.tsservices/platform/convex/workflow_engine/action_defs/workflow_processing_records/helpers/types.tsservices/platform/convex/workflow_engine/action_defs/workflow_processing_records/workflow_processing_records_action.tsservices/platform/convex/workflow_engine/helpers/validation/variables/action_schemas.tsservices/platform/convex/workflow_engine/workflow_syntax_compact.tsservices/platform/convex/workflows/processing_records/get_table_indexes.tsservices/platform/convex/workflows/processing_records/internal_mutations.tsservices/platform/convex/workflows/processing_records/types.tsservices/platform/lib/shared/schemas/websites.tsservices/platform/messages/en.json
💤 Files with no reviewable changes (42)
- services/platform/convex/lib/rls/helpers/rls_rules.ts
- services/platform/convex/workflows/processing_records/get_table_indexes.ts
- services/platform/convex/website_page_embeddings/chunk_content.ts
- services/platform/convex/websites/rescan_website.ts
- services/platform/convex/websites/register_urls.ts
- services/platform/convex/workflow_engine/action_defs/action_registry.ts
- services/platform/convex/websites/get_page_by_url.ts
- services/platform/convex/predefined_workflows/index.ts
- services/platform/convex/website_page_embeddings/rrf.ts
- services/platform/app/features/automations/utils/step-icons.tsx
- services/platform/convex/website_page_embeddings/embedding_pool.ts
- services/platform/convex/websites/get_pages_by_website.ts
- services/platform/convex/websites/bulk_upsert_pages.ts
- services/platform/convex/workflow_engine/action_defs/website_pages/website_pages_action.ts
- services/platform/convex/website_page_embeddings/internal_actions.ts
- services/platform/convex/workflow_engine/workflow_syntax_compact.ts
- services/platform/convex/website_page_embeddings/rrf.test.ts
- services/platform/convex/agent_tools/database/helpers/schema_definitions.ts
- services/platform/convex/websites/helpers.ts
- services/platform/convex/websites/cleanup_website.ts
- services/platform/convex/websites/queries.ts
- services/platform/convex/websites/list_website_pages_paginated.test.ts
- services/platform/convex/workflows/processing_records/internal_mutations.ts
- services/platform/convex/website_page_embeddings/schema.ts
- services/platform/convex/website_page_embeddings/chunk_content.test.ts
- services/platform/convex/website_page_embeddings/content_hash.ts
- services/platform/convex/convex.config.ts
- services/platform/convex/website_page_embeddings/internal_mutations.ts
- services/platform/convex/website_page_embeddings/content_hash.test.ts
- services/platform/convex/predefined_workflows/website_scan.ts
- services/platform/convex/websites/cleanup_website.test.ts
- services/platform/app/features/websites/hooks/queries.ts
- services/platform/convex/lib/embedding_config.ts
- services/platform/convex/workflow_engine/action_defs/website_pages/helpers/types.ts
- services/platform/convex/websites/list_website_pages_paginated.ts
- services/platform/convex/workflow_engine/action_defs/workflow_processing_records/workflow_processing_records_action.ts
- services/platform/convex/lib/embedding_config.test.ts
- services/platform/convex/website_page_embeddings/internal_queries.ts
- services/crawler/tests/test_website_store.py
- services/db/init-scripts/01-init-timescaledb.sql
- services/crawler/app/services/website_store.py
- services/platform/convex/websites/provision_website_scan_workflow.ts
| # Dependencies | ||
| depends_on: | ||
| - db | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider using health-based dependency for reliable startup.
The crawler now requires the database for pg_store_manager initialization. Using depends_on: - db only ensures the container starts, not that PostgreSQL is ready to accept connections. This may cause connection errors during crawler startup.
Proposed fix to wait for healthy database
# Dependencies
depends_on:
- - db
+ db:
+ condition: service_healthy📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Dependencies | |
| depends_on: | |
| - db | |
| # Dependencies | |
| depends_on: | |
| db: | |
| condition: service_healthy | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@compose.yml` around lines 151 - 154, Replace the simple depends_on: - db with
a health-based dependency so the crawler waits until PostgreSQL is ready; add a
Docker healthcheck to the db service (or ensure it exists) and change the
crawler's depends_on entry to use condition: service_healthy (refer to the
depends_on key and the db service) so pg_store_manager initialization won't
attempt connections before the DB is accepting connections.
| def get_embedding_dimensions(self) -> int: | ||
| """Get embedding dimensions from CRAWLER_EMBEDDING_DIMENSIONS or EMBEDDING_DIMENSIONS.""" | ||
| dims = self.embedding_dimensions | ||
| if dims is None: | ||
| raw = os.environ.get("EMBEDDING_DIMENSIONS") | ||
| if raw is not None: | ||
| dims = int(raw) | ||
| if dims is None: | ||
| raise ValueError("EMBEDDING_DIMENSIONS must be set in environment.") | ||
| return dims |
There was a problem hiding this comment.
Validate embedding dimensions as a positive integer at config boundary.
Line 101 currently accepts non-positive values (e.g., 0, -1), which are invalid for embedding/vector workflows and fail late downstream.
Suggested hardening patch
def get_embedding_dimensions(self) -> int:
"""Get embedding dimensions from CRAWLER_EMBEDDING_DIMENSIONS or EMBEDDING_DIMENSIONS."""
dims = self.embedding_dimensions
if dims is None:
raw = os.environ.get("EMBEDDING_DIMENSIONS")
if raw is not None:
- dims = int(raw)
+ try:
+ dims = int(raw)
+ except ValueError as exc:
+ raise ValueError("EMBEDDING_DIMENSIONS must be an integer.") from exc
- if dims is None:
+ if dims is None:
raise ValueError("EMBEDDING_DIMENSIONS must be set in environment.")
+ if dims <= 0:
+ raise ValueError("EMBEDDING_DIMENSIONS must be a positive integer.")
return dims🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/config.py` around lines 95 - 104, The
get_embedding_dimensions method currently permits non-positive integers; update
get_embedding_dimensions (and its use of self.embedding_dimensions and the
EMBEDDING_DIMENSIONS env var) to validate that the resolved value is an integer
> 0: when reading from self.embedding_dimensions accept only positive ints, when
parsing os.environ["EMBEDDING_DIMENSIONS"] convert with int inside a try/except
and raise a clear ValueError for non-integer input, and if the value is <= 0
raise ValueError("EMBEDDING_DIMENSIONS must be a positive integer") so invalid
values fail fast at config boundary.
| LEFT JOIN ( | ||
| SELECT url, COUNT(*) AS chunks_count | ||
| FROM chunks | ||
| GROUP BY url | ||
| ) c ON c.url = wu.url | ||
| WHERE {where_clause} |
There was a problem hiding this comment.
Join chunk counts on both domain and url to avoid miscounted indexing state.
At Line 57, the join keys only on url, but chunk storage is domain-scoped. This can inflate/bleed chunks_count across sites and incorrectly set indexed.
Proposed fix
- LEFT JOIN (
- SELECT url, COUNT(*) AS chunks_count
- FROM chunks
- GROUP BY url
- ) c ON c.url = wu.url
+ LEFT JOIN (
+ SELECT domain, url, COUNT(*) AS chunks_count
+ FROM chunks
+ GROUP BY domain, url
+ ) c ON c.domain = wu.domain AND c.url = wu.url📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| LEFT JOIN ( | |
| SELECT url, COUNT(*) AS chunks_count | |
| FROM chunks | |
| GROUP BY url | |
| ) c ON c.url = wu.url | |
| WHERE {where_clause} | |
| LEFT JOIN ( | |
| SELECT domain, url, COUNT(*) AS chunks_count | |
| FROM chunks | |
| GROUP BY domain, url | |
| ) c ON c.domain = wu.domain AND c.url = wu.url | |
| WHERE {where_clause} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/pages.py` around lines 53 - 58, The LEFT JOIN
that aggregates chunk counts currently groups and joins only by url causing
counts to bleed across domains; update the chunks subquery (alias c) to SELECT
domain, url, COUNT(*) AS chunks_count and GROUP BY domain, url, and change the
JOIN condition to match both c.url = wu.url AND c.domain = wu.domain so chunk
counts are domain-scoped when joining to the work unit rows in pages.py.
| if isinstance(sd, str): | ||
| sd = json.loads(sd) | ||
| description = extract_meta_description(sd) |
There was a problem hiding this comment.
Don’t fail homepage ingestion on malformed structured_data JSON.
A bad JSON payload currently aborts the entire homepage update path. Parse defensively and continue with structured_data=None instead of dropping the crawl result.
Proposed fix
sd = page.get("structured_data")
if isinstance(sd, str):
- sd = json.loads(sd)
+ try:
+ sd = json.loads(sd)
+ except json.JSONDecodeError:
+ logger.warning(f"Invalid structured_data JSON for {domain}; ignoring")
+ sd = None
description = extract_meta_description(sd)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if isinstance(sd, str): | |
| sd = json.loads(sd) | |
| description = extract_meta_description(sd) | |
| if isinstance(sd, str): | |
| try: | |
| sd = json.loads(sd) | |
| except json.JSONDecodeError: | |
| logger.warning(f"Invalid structured_data JSON for {domain}; ignoring") | |
| sd = None | |
| description = extract_meta_description(sd) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/websites.py` around lines 53 - 55, The code
currently does json.loads(sd) and will raise on malformed JSON, aborting
homepage ingestion; wrap the string parse in a try/except (catch
json.JSONDecodeError and ValueError) around the json.loads call in the same
block that checks isinstance(sd, str), and on failure set sd = None (or
structured_data = None if that name is used downstream) and log a debug/warn
message so execution proceeds to call extract_meta_description(sd) with sd=None;
reference the sd variable and the call to extract_meta_description to locate
where to change behavior.
| await asyncio.gather(_crawl_homepage(), _discover_urls()) | ||
|
|
||
| await manager.update_last_scanned(domain) | ||
| await manager.update_scan_status(domain, "active") | ||
|
|
There was a problem hiding this comment.
Initialization status is forced to active even when background steps fail.
Both subtasks catch and log errors internally, but final status is always set to active. This masks failures and weakens retry/alert behavior.
Proposed fix
async def _initialize_website(domain: str, manager: PgWebsiteStoreManager):
@@
+ homepage_ok = False
+ discovery_ok = False
+
async def _crawl_homepage():
+ nonlocal homepage_ok
@@
await manager.update_website_metadata(
domain=domain,
title=title,
description=description,
page_count=1,
)
+ homepage_ok = True
except Exception:
logger.exception(f"Failed to crawl homepage for {domain}")
@@
async def _discover_urls():
+ nonlocal discovery_ok
@@
if discovered:
await site_store.save_discovered_urls(discovered)
logger.info(f"Discovered {len(discovered)} URLs for {domain}")
+ discovery_ok = True
except Exception:
logger.exception(f"URL discovery failed for {domain}")
@@
await manager.update_last_scanned(domain)
- await manager.update_scan_status(domain, "active")
+ await manager.update_scan_status(domain, "active" if (homepage_ok or discovery_ok) else "error")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/websites.py` around lines 90 - 94, The code
currently awaits asyncio.gather(_crawl_homepage(), _discover_urls()) then
unconditionally calls manager.update_scan_status(domain, "active"); change the
flow so failure in either background step results in a non-active status: have
_crawl_homepage and _discover_urls either return a success boolean or re-raise
on fatal error, call results = await asyncio.gather(_crawl_homepage(),
_discover_urls(), return_exceptions=True), inspect results for exceptions or any
False return values, and only call manager.update_scan_status(domain, "active")
(and update_last_scanned(domain)) when all tasks succeeded; otherwise call
manager.update_scan_status(domain, "failed") (or an appropriate error status) so
failures aren’t masked.
| export function ensureUrl(s: string) { | ||
| return s.startsWith('http://') || s.startsWith('https://') | ||
| ? s | ||
| : `https://${s}`; | ||
| } | ||
|
|
||
| function _scanIntervalToCron(interval: string): { | ||
| schedule: string; | ||
| timezone: string; | ||
| } { | ||
| // Default timezone to UTC; can be made configurable later | ||
| const timezone = 'UTC'; | ||
| switch (interval) { | ||
| case '60m': | ||
| return { schedule: '0 * * * *', timezone }; | ||
| case '6h': | ||
| return { schedule: '0 */6 * * *', timezone }; | ||
| case '12h': | ||
| return { schedule: '0 */12 * * *', timezone }; | ||
| case '1d': | ||
| return { schedule: '0 2 * * *', timezone }; // Daily at 02:00 UTC | ||
| case '5d': | ||
| return { schedule: '0 2 */5 * *', timezone }; // Every 5 days at 02:00 UTC | ||
| case '7d': | ||
| return { schedule: '0 2 */7 * *', timezone }; // Every 7 days at 02:00 UTC | ||
| case '30d': | ||
| return { schedule: '0 2 1 * *', timezone }; // Monthly on the 1st at 02:00 UTC | ||
| default: | ||
| // Fallback to daily at 02:00 UTC | ||
| return { schedule: '0 2 * * *', timezone }; | ||
| } | ||
| export function toWebsiteDomain(input: string): string { | ||
| return new URL(ensureUrl(input)).hostname; |
There was a problem hiding this comment.
Fix scheme detection to be case-insensitive before parsing.
Line 19 only matches lowercase schemes. Inputs like HTTPS://example.com become https://HTTPS://example.com, and Line 25 can resolve the hostname incorrectly (https), causing bad stored domains and failed dedupe.
🐛 Proposed fix
export function ensureUrl(s: string) {
- return s.startsWith('http://') || s.startsWith('https://')
- ? s
- : `https://${s}`;
+ const trimmed = s.trim();
+ return /^https?:\/\//i.test(trimmed) ? trimmed : `https://${trimmed}`;
}#!/bin/bash
# Verifies current behavior for mixed-case schemes.
node - <<'NODE'
function ensureUrl(s){return s.startsWith('http://')||s.startsWith('https://')?s:`https://${s}`;}
for (const input of ['https://example.com', 'HTTPS://example.com', 'HTTP://example.com']) {
console.log(input, '=>', new URL(ensureUrl(input)).hostname);
}
NODE🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/websites/create_website.ts` around lines 18 - 25,
ensureUrl currently only matches lowercase 'http://'/'https://' so inputs with
mixed-case schemes (e.g. 'HTTPS://...') get double-prefixed; update ensureUrl to
perform a case-insensitive check (e.g. test s against /^https?:\/\//i or compare
s.toLowerCase().startsWith(...)) and return the original string when a scheme
exists, otherwise prepend 'https://'. Keep toWebsiteDomain as-is (it should call
the fixed ensureUrl) so new URL(ensureUrl(input)).hostname yields the correct
stored domain.
| default: | ||
| return 21600; |
There was a problem hiding this comment.
Unknown scan intervals silently map to 6 hours.
This masks invalid input and can schedule unintended crawl frequency.
🎯 Proposed fix
case '30d':
return 2592000;
default:
- return 21600;
+ throw new Error(`Unsupported scan interval: ${interval}`);
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| default: | |
| return 21600; | |
| default: | |
| throw new Error(`Unsupported scan interval: ${interval}`); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/websites/internal_actions.ts` around lines 49 - 50,
The switch that handles scan interval values currently uses a silent default of
"return 21600", which masks invalid inputs; update the default branch (the
switch handling scan intervals where "default: return 21600") to validate
unknown values explicitly by either throwing a descriptive Error (e.g., "Invalid
scan interval: <value>") or returning a clearly handled sentinel
(null/undefined) and logging the invalid input so callers must handle it; ensure
the change surfaces invalid inputs instead of silently scheduling a 6-hour
interval.
| for (const website of websites) { | ||
| const lastSync = website.metadata?.lastStatusSyncAt; | ||
| if (typeof lastSync === 'number' && now - lastSync < SYNC_INTERVAL_MS) { | ||
| continue; | ||
| } | ||
|
|
||
| try { | ||
| const websiteInfo = await fetchWebsiteInfo(website.domain); | ||
|
|
||
| if (websiteInfo) { | ||
| await ctx.runMutation( | ||
| internal.websites.internal_mutations.patchWebsite, | ||
| { | ||
| websiteId: website._id, | ||
| metadata: { | ||
| ...website.metadata, | ||
| lastStatusSyncAt: now, | ||
| lastSyncError: undefined, | ||
| }, | ||
| status: websiteInfo.status, | ||
| pageCount: websiteInfo.page_count, | ||
| crawledPageCount: websiteInfo.crawled_count, | ||
| title: websiteInfo.title ?? undefined, | ||
| description: websiteInfo.description ?? undefined, | ||
| lastScannedAt: websiteInfo.last_scanned_at | ||
| ? new Date(websiteInfo.last_scanned_at).getTime() | ||
| : undefined, | ||
| }, | ||
| ); | ||
| } | ||
| } catch (error) { | ||
| const message = error instanceof Error ? error.message : String(error); | ||
| console.warn(`Failed to sync status for ${website.domain}: ${message}`); | ||
| await ctx.runMutation( | ||
| internal.websites.internal_mutations.patchWebsite, | ||
| { | ||
| websiteId: website._id, | ||
| metadata: { | ||
| ...website.metadata, | ||
| lastStatusSyncAt: now, | ||
| lastSyncError: message, | ||
| }, | ||
| }, | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Status sync is fully serial and can time out on larger tenants.
Each website does network I/O + mutation sequentially; this does not scale with action time budgets.
⚡ Proposed fix (bounded concurrency)
+const STATUS_SYNC_CONCURRENCY = 5;
@@
- for (const website of websites) {
- const lastSync = website.metadata?.lastStatusSyncAt;
- if (typeof lastSync === 'number' && now - lastSync < SYNC_INTERVAL_MS) {
- continue;
- }
- // existing sync logic...
- }
+ for (let i = 0; i < websites.length; i += STATUS_SYNC_CONCURRENCY) {
+ const batch = websites.slice(i, i + STATUS_SYNC_CONCURRENCY);
+ await Promise.allSettled(
+ batch.map(async (website) => {
+ const lastSync = website.metadata?.lastStatusSyncAt;
+ if (typeof lastSync === 'number' && now - lastSync < SYNC_INTERVAL_MS) return;
+ // existing sync logic...
+ }),
+ );
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/websites/internal_actions.ts` around lines 130 -
175, The loop over websites in internal_actions.ts runs fetchWebsiteInfo and
ctx.runMutation serially causing timeouts; change it to process eligible
websites concurrently with a bounded concurrency (worker pool) to avoid
unbounded parallelism: filter websites by lastStatusSyncAt/SYNC_INTERVAL_MS
using now, then schedule tasks that call fetchWebsiteInfo(website.domain) and on
success or error call
ctx.runMutation(internal.websites.internal_mutations.patchWebsite, ...) for that
website; implement a simple concurrency limiter (e.g., queue of promises with a
maxWorkers constant) to run at most N fetch+mutation tasks in parallel and await
completion of all tasks before finishing the action.
| offset: v.optional(v.number()), | ||
| limit: v.optional(v.number()), | ||
| }, | ||
| handler: async (_ctx, args) => { | ||
| const crawlerUrl = process.env.CRAWLER_URL || 'http://localhost:8002'; | ||
| try { | ||
| await fetch(`${crawlerUrl}/api/v1/websites/${args.domain}`, { | ||
| method: 'DELETE', | ||
| }); | ||
| } catch (e) { | ||
| console.warn('Failed to deregister website from crawler:', e); | ||
| const crawlerUrl = getCrawlerUrl(); | ||
| const offset = args.offset ?? 0; | ||
| const limit = args.limit ?? 100; |
There was a problem hiding this comment.
offset/limit are not bounded before proxying to crawler APIs.
Negative or very large values can trigger expensive downstream calls and unstable behavior.
🔒 Proposed bounds
- const offset = args.offset ?? 0;
- const limit = args.limit ?? 100;
+ const offset = Math.max(0, args.offset ?? 0);
+ const limit = Math.min(200, Math.max(1, args.limit ?? 100));
@@
- const limit = args.limit ?? 10;
+ const limit = Math.min(50, Math.max(1, args.limit ?? 10));Also applies to: 286-291
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/websites/internal_actions.ts` around lines 231 -
237, Validate and clamp the offset and limit before using them to call the
crawler: in the handler where offset and limit are set (variables offset and
limit in the async handler that calls getCrawlerUrl()), enforce offset >= 0 and
limit within a reasonable max (e.g., 1..1000) or throw a validation error; apply
the same fix to the other handler referenced around the second occurrence (the
handler near lines 286-291). Ensure you normalize incoming args (args.offset,
args.limit) and replace downstream uses so only the validated/clamped values are
proxied to the crawler APIs.
| pageCount: v.optional(v.number()), | ||
| crawledPageCount: v.optional(v.number()), | ||
| metadata: v.optional(jsonRecordValidator), | ||
| }, | ||
| handler: async (ctx, args) => { | ||
| return await WebsitesHelpers.updateWebsite(ctx, args); | ||
| }, |
There was a problem hiding this comment.
Validate pageCount and crawledPageCount as non-negative before update.
These new fields currently accept negative values, which can produce invalid website stats.
Proposed fix
handler: async (ctx, args) => {
+ if (args.pageCount !== undefined && args.pageCount < 0) {
+ throw new Error('pageCount must be non-negative');
+ }
+ if (args.crawledPageCount !== undefined && args.crawledPageCount < 0) {
+ throw new Error('crawledPageCount must be non-negative');
+ }
return await WebsitesHelpers.updateWebsite(ctx, args);
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pageCount: v.optional(v.number()), | |
| crawledPageCount: v.optional(v.number()), | |
| metadata: v.optional(jsonRecordValidator), | |
| }, | |
| handler: async (ctx, args) => { | |
| return await WebsitesHelpers.updateWebsite(ctx, args); | |
| }, | |
| pageCount: v.optional(v.number()), | |
| crawledPageCount: v.optional(v.number()), | |
| metadata: v.optional(jsonRecordValidator), | |
| }, | |
| handler: async (ctx, args) => { | |
| if (args.pageCount !== undefined && args.pageCount < 0) { | |
| throw new Error('pageCount must be non-negative'); | |
| } | |
| if (args.crawledPageCount !== undefined && args.crawledPageCount < 0) { | |
| throw new Error('crawledPageCount must be non-negative'); | |
| } | |
| return await WebsitesHelpers.updateWebsite(ctx, args); | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/websites/internal_mutations.ts` around lines 48 -
54, The pageCount and crawledPageCount fields currently allow negatives; update
their validation so they must be non-negative before calling
WebsitesHelpers.updateWebsite. In the mutation schema (where pageCount and
crawledPageCount are defined) change their validators to enforce >=0 (e.g.,
replace v.number() with a non-negative number validator or add .min(0) / integer
check), or add an explicit pre-check in the handler to throw or clamp negatives
before invoking WebsitesHelpers.updateWebsite; ensure the checks reference the
same field names pageCount and crawledPageCount so invalid values are
rejected/normalized prior to the update.
…ex reliability The embedding column starts as untyped `vector` since dimensions are configurable. This caused HNSW index creation to fail on empty tables. Now init_pool pins the column to an explicit dimension via ALTER TABLE before creating the index, and re-pins if the configured dimension changes while the table is empty. A SQL-side guard also rejects bare `vector` columns to prevent silent index failures.
- Sort imports in crawler models.py (ruff I001) - Add strict=True to zip() in scheduler.py (ruff B905) - Remove invalid padding/gap props from BorderedSection usage - Remove unused @ai-sdk/provider dependency
…rs in queries URLs and other inputs containing colons, parentheses, or slashes caused Tantivy query parser errors. Switch from raw string @@@ to structured paradedb.match() which bypasses the string parser entirely.
Summary
semantic-text-splitterMarkdownSplitter for structure-aware content splittingChanges
search,pages,index), services (chunking,embedding,indexing,search,database,pg_website_store), and config moduletale_searchdatabase with pgvector, drop legacy TimescaleDBwebsite_page_embeddings/module andwebsite_pagesworkflow action; website operations now delegate to crawler via internal actionsTest plan
test_chunking_service,test_embedding_service,test_indexing_service,test_search_service,test_database,test_*_router)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores