Skip to content

ci: add CI workflows for all services and standardize build triggers#9

Merged
larryro merged 1 commit into
mainfrom
built-images
Dec 10, 2025
Merged

ci: add CI workflows for all services and standardize build triggers#9
larryro merged 1 commit into
mainfrom
built-images

Conversation

@larryro
Copy link
Copy Markdown
Collaborator

@larryro larryro commented Dec 10, 2025

  • Add build-and-push workflows for crawler, platform, proxy, and search services
  • Enable automatic builds on main branch with path-based triggers for all services
  • Update existing db, graph-db, and rag workflows to use consistent path triggers
  • Update db workflow to use new services/db path for Dockerfile
  • Remove obsolete build-platform.yml in favor of new build-and-push-platform.yml
  • Update docs to reflect SITE_URL usage instead of NEXT_PUBLIC_APP_URL

Summary by CodeRabbit

  • Chores

    • Added automated Docker image build and push workflows for Crawler, Platform, Proxy, and Search services with multi-architecture support (linux/amd64 and linux/arm64).
    • Enhanced existing CI/CD workflows (DB, Graph DB, RAG) to trigger on file changes in addition to tags.
    • Removed legacy Platform build workflow.
  • Documentation

    • Updated URL configuration guidance with runtime-based derivation approach.
    • Clarified environment variable usage (SITE_URL, DOMAIN) for OAuth2 and email provider setup.

✏️ Tip: You can customize this high-level summary in your review settings.

- Add build-and-push workflows for crawler, platform, proxy, and search services
- Enable automatic builds on main branch with path-based triggers for all services
- Update existing db, graph-db, and rag workflows to use consistent path triggers
- Update db workflow to use new services/db path for Dockerfile
- Remove obsolete build-platform.yml in favor of new build-and-push-platform.yml
- Update docs to reflect SITE_URL usage instead of NEXT_PUBLIC_APP_URL
@larryro larryro merged commit 89ef494 into main Dec 10, 2025
1 check was pending
@larryro larryro deleted the built-images branch December 10, 2025 07:22
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 10, 2025

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces a standardized multi-architecture Docker build-and-push workflow infrastructure across multiple services. It adds four new GitHub Actions workflows (crawler, platform, proxy, search) that build and push images to GHCR with multi-platform support (linux/amd64, linux/arm64) and automated testing. Simultaneously, existing workflows (db, graph-db, rag) are updated to include file-path-based triggers in addition to tag-based triggers. The legacy build-platform.yml workflow is removed in favor of the new standardized approach. Documentation is updated to reflect runtime URL derivation using SITE_URL instead of NEXT_PUBLIC_* variables.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

  • Workflow consistency: Verify that all four new build-and-push workflows (crawler, platform, proxy, search) follow identical patterns for metadata extraction, build contexts, and test job logic
  • File path accuracy: Confirm Dockerfile locations and build contexts match actual service structures (services/crawler/Dockerfile, services/platform/Dockerfile, services/proxy/Dockerfile, services/search/Dockerfile)
  • Trigger patterns: Validate that push triggers and tag patterns (crawler-v*, platform-v*, proxy-v*, search-v*) are correctly specified across workflows
  • Test job implementations: Review differences in test approaches (Python import check vs. caddy version vs. health endpoint checks)
  • Documentation accuracy: Validate the URL derivation logic described in docs/url-configuration.md and environment variable mappings in docs/email-providers.md

Possibly related PRs

  • talecorp/poc2#35: Introduces the services/platform component that the new build-and-push-platform.yml workflow targets and orchestrates
  • tale-project/poc2#328: Directly removes/replaces build-platform.yml, the legacy workflow this PR supersedes with standardized multi-architecture workflows
  • tale-project/poc2#316: Updates environment variable naming conventions across workflows and service configurations, overlapping with the SITE_URL/DOMAIN refactoring in this PR

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between b6a7af2 and 540fa4b.

📒 Files selected for processing (10)
  • .github/workflows/build-and-push-crawler.yml (1 hunks)
  • .github/workflows/build-and-push-db.yml (2 hunks)
  • .github/workflows/build-and-push-graph-db.yml (1 hunks)
  • .github/workflows/build-and-push-platform.yml (1 hunks)
  • .github/workflows/build-and-push-proxy.yml (1 hunks)
  • .github/workflows/build-and-push-rag.yml (1 hunks)
  • .github/workflows/build-and-push-search.yml (1 hunks)
  • .github/workflows/build-platform.yml (0 hunks)
  • docs/email-providers.md (2 hunks)
  • docs/url-configuration.md (8 hunks)

Comment @coderabbitai help to get the list of available commands and usage tips.

larryro added a commit that referenced this pull request Dec 30, 2025
The function now handles both seconds and milliseconds timestamps using
a heuristic: timestamps < 1e11 are treated as seconds and converted to
milliseconds. This prevents silent miscalculations when metadata contains
seconds-based timestamps from sources like RAG indexing.

Addresses CodeRabbit review comment #9.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
larryro added a commit that referenced this pull request Dec 30, 2025
The function now handles both seconds and milliseconds timestamps using
a heuristic: timestamps < 1e11 are treated as seconds and converted to
milliseconds. This prevents silent miscalculations when metadata contains
seconds-based timestamps from sources like RAG indexing.

Addresses CodeRabbit review comment #9.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
larryro added a commit that referenced this pull request May 7, 2026
…cies

Bundle of round-2-confirmed cross-tenant fixes plus the dead-code
delete of the semantic LLM response cache.

POLICY_TYPES drift (W6 #5)
- lib/shared/schemas/governance.ts now includes
  'data_classification_notice' to match the Convex enum, killing the
  `as const` cast at use-data-classification-notice.ts:50.

documents/compare_documents.ts (W6 #8)
- Convex `_storage` is a global namespace; org membership alone was
  not enough to gate `ctx.storage.getUrl`. Adds a JOIN through
  fileMetadata via the new internal query verifyStorageIdsBelongToOrg
  to confirm both `baseStorageId` and `comparisonStorageId` are owned
  by the caller's org. Refuses with a clear error otherwise. Pattern
  copied from agent_tools/documents/helpers/retrieve_document.ts.

file_metadata/actions.ts::checkFileRagStatuses (W6 #9)
- Was an unauthenticated public action that could flip any org's
  fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS,
  pre-existing on `main`). Now requires `getAuthUser` and filters
  storageIds to ones owned by an org the caller is a member of via
  the new file_metadata.internal_queries.filterStorageIdsByCallerOrg.

governance/queries.ts (W6 #11)
- getPolicy + listPolicies now apply a member-readable allow-list
  (data_classification_notice, feature_flags, pii_config,
  chat_filter, personalization, upload_policy, default_models). All
  other types — login_policy.trustedProxies, password_policy,
  two_factor_policy, model_access.rules, budgets, retention_policy,
  moderation_provider.endpoint, system_prompt — are admin-only.
  listPolicies silently filters those out for non-admins.

semantic LLM response cache — DELETE (W6 #12 + #13)
- Round-2 v05 confirmed the lookup is structurally cross-tenant
  (filters only on agent_name, model, expires_at, similarity; ignores
  user_id / organization_id even though they're stored). The platform
  helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO
  callers in the monorepo, the FastAPI router was mounted but
  unreachable from platform — a latent foot-gun primed for the next
  dev to wire up unaware. Deletes:

  - services/platform/convex/lib/response_cache/semantic_cache.ts
  - services/platform/convex/lib/response_cache/internal_actions.ts
  - services/rag/app/routers/llm_cache.py
  - services/rag/app/services/llm_response_cache.py

  Plus the corresponding imports in routers/__init__.py, main.py,
  rag_service.py. Also removes the two empty-catch violations in
  semantic_cache.ts (no longer applicable).

The exact-key Convex `lib/response_cache/{internal_mutations,
internal_queries}.ts` cache stays — it is the actually-wired one and
is correctly org-scoped.
larryro added a commit that referenced this pull request May 7, 2026
…eout

Round-2 v15 confirmed: /config unauthenticated, /openapi.json + /docs +
/redoc unauthenticated, RAG container ran as root, default token baked
into image ENV, strict-mode env name diverged across the wire,
non-constant-time token compare, plus three SSRF-guard gaps.

services/rag/app/auth.py
- W7 #3: hmac.compare_digest replaces == on the bearer compare. Removes
  the dead-code EXEMPT_PATHS frozenset.

services/rag/app/routers/health.py
- W7 #1: split into public_router (`/`, `/health`) and protected_router
  (`/config`). main.py mounts the protected one under
  Depends(verify_internal_token). Old `router` re-export stays for
  backwards compat.

services/rag/app/main.py
- W7 #2: docs_url / redoc_url / openapi_url are None outside debug.
- W7 #4: CORS allow_credentials flipped to False (bearer rides
  Authorization, never cookies).
- W7 #1 wiring: mount health-public + health-protected separately.

services/rag/app/config.py
- W7 #8: require_custom_internal_token accepts BOTH
  RAG_REQUIRE_CUSTOM_INTERNAL_TOKEN and TALE_REQUIRE_CUSTOM_RAG_TOKEN
  via pydantic AliasChoices.

services/rag/Dockerfile + services/convex/Dockerfile
- W7 #5: RAG container runs as non-root (uid:gid 1001:1001 `app`).
  RAG ingests untrusted PDFs/DOCX through native parsers; biggest
  blast radius in the stack, now hardened.
- W7 #6: removed RAG_INTERNAL_TOKEN=tale-rag-dev-only ENV bake from
  both runtime + scratch-squash stages and the matching bake in
  services/convex/Dockerfile. Operators MUST supply via env / compose
  / k8s secret.

services/platform/convex/lib/helpers/rag_config.ts
- W7 #9 F1: `redirect: 'manual'` on every ragFetch.
- W7 #9 F2: added fc00::/7 (IPv6 ULA) to v6 blocklist (AWS IPv6 IMDSv2).
- W7 #9 F3: strip trailing `.` before hostname blocklist lookup.
- W7 #9 F4: re-validate URL per ragFetch invocation (DNS rebinding +
  env rotation mitigation).
- W7 #9 F9: deleted path.startsWith('http') override branch (future-
  bypass foot-gun).

services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
- W7 #10: pass timeoutMs=60_000 (default 10s was a regression).
- Plus MAX_ITERATIONS=30 cap and "cursor did not advance" break to
  defend against an adversarial RAG response.
larryro added a commit that referenced this pull request May 8, 2026
…cies

Bundle of round-2-confirmed cross-tenant fixes plus the dead-code
delete of the semantic LLM response cache.

POLICY_TYPES drift (W6 #5)
- lib/shared/schemas/governance.ts now includes
  'data_classification_notice' to match the Convex enum, killing the
  `as const` cast at use-data-classification-notice.ts:50.

documents/compare_documents.ts (W6 #8)
- Convex `_storage` is a global namespace; org membership alone was
  not enough to gate `ctx.storage.getUrl`. Adds a JOIN through
  fileMetadata via the new internal query verifyStorageIdsBelongToOrg
  to confirm both `baseStorageId` and `comparisonStorageId` are owned
  by the caller's org. Refuses with a clear error otherwise. Pattern
  copied from agent_tools/documents/helpers/retrieve_document.ts.

file_metadata/actions.ts::checkFileRagStatuses (W6 #9)
- Was an unauthenticated public action that could flip any org's
  fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS,
  pre-existing on `main`). Now requires `getAuthUser` and filters
  storageIds to ones owned by an org the caller is a member of via
  the new file_metadata.internal_queries.filterStorageIdsByCallerOrg.

governance/queries.ts (W6 #11)
- getPolicy + listPolicies now apply a member-readable allow-list
  (data_classification_notice, feature_flags, pii_config,
  chat_filter, personalization, upload_policy, default_models). All
  other types — login_policy.trustedProxies, password_policy,
  two_factor_policy, model_access.rules, budgets, retention_policy,
  moderation_provider.endpoint, system_prompt — are admin-only.
  listPolicies silently filters those out for non-admins.

semantic LLM response cache — DELETE (W6 #12 + #13)
- Round-2 v05 confirmed the lookup is structurally cross-tenant
  (filters only on agent_name, model, expires_at, similarity; ignores
  user_id / organization_id even though they're stored). The platform
  helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO
  callers in the monorepo, the FastAPI router was mounted but
  unreachable from platform — a latent foot-gun primed for the next
  dev to wire up unaware. Deletes:

  - services/platform/convex/lib/response_cache/semantic_cache.ts
  - services/platform/convex/lib/response_cache/internal_actions.ts
  - services/rag/app/routers/llm_cache.py
  - services/rag/app/services/llm_response_cache.py

  Plus the corresponding imports in routers/__init__.py, main.py,
  rag_service.py. Also removes the two empty-catch violations in
  semantic_cache.ts (no longer applicable).

The exact-key Convex `lib/response_cache/{internal_mutations,
internal_queries}.ts` cache stays — it is the actually-wired one and
is correctly org-scoped.
larryro added a commit that referenced this pull request May 8, 2026
…eout

Round-2 v15 confirmed: /config unauthenticated, /openapi.json + /docs +
/redoc unauthenticated, RAG container ran as root, default token baked
into image ENV, strict-mode env name diverged across the wire,
non-constant-time token compare, plus three SSRF-guard gaps.

services/rag/app/auth.py
- W7 #3: hmac.compare_digest replaces == on the bearer compare. Removes
  the dead-code EXEMPT_PATHS frozenset.

services/rag/app/routers/health.py
- W7 #1: split into public_router (`/`, `/health`) and protected_router
  (`/config`). main.py mounts the protected one under
  Depends(verify_internal_token). Old `router` re-export stays for
  backwards compat.

services/rag/app/main.py
- W7 #2: docs_url / redoc_url / openapi_url are None outside debug.
- W7 #4: CORS allow_credentials flipped to False (bearer rides
  Authorization, never cookies).
- W7 #1 wiring: mount health-public + health-protected separately.

services/rag/app/config.py
- W7 #8: require_custom_internal_token accepts BOTH
  RAG_REQUIRE_CUSTOM_INTERNAL_TOKEN and TALE_REQUIRE_CUSTOM_RAG_TOKEN
  via pydantic AliasChoices.

services/rag/Dockerfile + services/convex/Dockerfile
- W7 #5: RAG container runs as non-root (uid:gid 1001:1001 `app`).
  RAG ingests untrusted PDFs/DOCX through native parsers; biggest
  blast radius in the stack, now hardened.
- W7 #6: removed RAG_INTERNAL_TOKEN=tale-rag-dev-only ENV bake from
  both runtime + scratch-squash stages and the matching bake in
  services/convex/Dockerfile. Operators MUST supply via env / compose
  / k8s secret.

services/platform/convex/lib/helpers/rag_config.ts
- W7 #9 F1: `redirect: 'manual'` on every ragFetch.
- W7 #9 F2: added fc00::/7 (IPv6 ULA) to v6 blocklist (AWS IPv6 IMDSv2).
- W7 #9 F3: strip trailing `.` before hostname blocklist lookup.
- W7 #9 F4: re-validate URL per ragFetch invocation (DNS rebinding +
  env rotation mitigation).
- W7 #9 F9: deleted path.startsWith('http') override branch (future-
  bypass foot-gun).

services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
- W7 #10: pass timeoutMs=60_000 (default 10s was a regression).
- Plus MAX_ITERATIONS=30 cap and "cursor did not advance" break to
  defend against an adversarial RAG response.
larryro added a commit that referenced this pull request May 8, 2026
…cies

Bundle of round-2-confirmed cross-tenant fixes plus the dead-code
delete of the semantic LLM response cache.

POLICY_TYPES drift (W6 #5)
- lib/shared/schemas/governance.ts now includes
  'data_classification_notice' to match the Convex enum, killing the
  `as const` cast at use-data-classification-notice.ts:50.

documents/compare_documents.ts (W6 #8)
- Convex `_storage` is a global namespace; org membership alone was
  not enough to gate `ctx.storage.getUrl`. Adds a JOIN through
  fileMetadata via the new internal query verifyStorageIdsBelongToOrg
  to confirm both `baseStorageId` and `comparisonStorageId` are owned
  by the caller's org. Refuses with a clear error otherwise. Pattern
  copied from agent_tools/documents/helpers/retrieve_document.ts.

file_metadata/actions.ts::checkFileRagStatuses (W6 #9)
- Was an unauthenticated public action that could flip any org's
  fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS,
  pre-existing on `main`). Now requires `getAuthUser` and filters
  storageIds to ones owned by an org the caller is a member of via
  the new file_metadata.internal_queries.filterStorageIdsByCallerOrg.

governance/queries.ts (W6 #11)
- getPolicy + listPolicies now apply a member-readable allow-list
  (data_classification_notice, feature_flags, pii_config,
  chat_filter, personalization, upload_policy, default_models). All
  other types — login_policy.trustedProxies, password_policy,
  two_factor_policy, model_access.rules, budgets, retention_policy,
  moderation_provider.endpoint, system_prompt — are admin-only.
  listPolicies silently filters those out for non-admins.

semantic LLM response cache — DELETE (W6 #12 + #13)
- Round-2 v05 confirmed the lookup is structurally cross-tenant
  (filters only on agent_name, model, expires_at, similarity; ignores
  user_id / organization_id even though they're stored). The platform
  helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO
  callers in the monorepo, the FastAPI router was mounted but
  unreachable from platform — a latent foot-gun primed for the next
  dev to wire up unaware. Deletes:

  - services/platform/convex/lib/response_cache/semantic_cache.ts
  - services/platform/convex/lib/response_cache/internal_actions.ts
  - services/rag/app/routers/llm_cache.py
  - services/rag/app/services/llm_response_cache.py

  Plus the corresponding imports in routers/__init__.py, main.py,
  rag_service.py. Also removes the two empty-catch violations in
  semantic_cache.ts (no longer applicable).

The exact-key Convex `lib/response_cache/{internal_mutations,
internal_queries}.ts` cache stays — it is the actually-wired one and
is correctly org-scoped.
larryro added a commit that referenced this pull request May 8, 2026
…eout

Round-2 v15 confirmed: /config unauthenticated, /openapi.json + /docs +
/redoc unauthenticated, RAG container ran as root, default token baked
into image ENV, strict-mode env name diverged across the wire,
non-constant-time token compare, plus three SSRF-guard gaps.

services/rag/app/auth.py
- W7 #3: hmac.compare_digest replaces == on the bearer compare. Removes
  the dead-code EXEMPT_PATHS frozenset.

services/rag/app/routers/health.py
- W7 #1: split into public_router (`/`, `/health`) and protected_router
  (`/config`). main.py mounts the protected one under
  Depends(verify_internal_token). Old `router` re-export stays for
  backwards compat.

services/rag/app/main.py
- W7 #2: docs_url / redoc_url / openapi_url are None outside debug.
- W7 #4: CORS allow_credentials flipped to False (bearer rides
  Authorization, never cookies).
- W7 #1 wiring: mount health-public + health-protected separately.

services/rag/app/config.py
- W7 #8: require_custom_internal_token accepts BOTH
  RAG_REQUIRE_CUSTOM_INTERNAL_TOKEN and TALE_REQUIRE_CUSTOM_RAG_TOKEN
  via pydantic AliasChoices.

services/rag/Dockerfile + services/convex/Dockerfile
- W7 #5: RAG container runs as non-root (uid:gid 1001:1001 `app`).
  RAG ingests untrusted PDFs/DOCX through native parsers; biggest
  blast radius in the stack, now hardened.
- W7 #6: removed RAG_INTERNAL_TOKEN=tale-rag-dev-only ENV bake from
  both runtime + scratch-squash stages and the matching bake in
  services/convex/Dockerfile. Operators MUST supply via env / compose
  / k8s secret.

services/platform/convex/lib/helpers/rag_config.ts
- W7 #9 F1: `redirect: 'manual'` on every ragFetch.
- W7 #9 F2: added fc00::/7 (IPv6 ULA) to v6 blocklist (AWS IPv6 IMDSv2).
- W7 #9 F3: strip trailing `.` before hostname blocklist lookup.
- W7 #9 F4: re-validate URL per ragFetch invocation (DNS rebinding +
  env rotation mitigation).
- W7 #9 F9: deleted path.startsWith('http') override branch (future-
  bypass foot-gun).

services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
- W7 #10: pass timeoutMs=60_000 (default 10s was a regression).
- Plus MAX_ITERATIONS=30 cap and "cursor did not advance" break to
  defend against an adversarial RAG response.
larryro added a commit that referenced this pull request May 9, 2026
… 2FA pepper

P0-16 — `scrubSubjectAuditLogs` doesn't clear `actorEmailHash` /
`actorIpHash` (round-1 #8, round-2 V6).
  Peppered hashes are pseudonymized PII per GDPR Art 4(5) — they're
  still personal data and must be cleared on Art 17 erasure. Without
  this, a subject's audit-chain entries kept a stable identifier even
  after `scrubSubjectAuditLogs` "scrubbed" them; re-identification was
  possible by the controller (or anyone with the pepper) by hashing a
  known email. The signed `pii_scrub` checkpoint window already permits
  the row's hash to diverge from its original (verifier skips chain
  re-compute inside the window), so clearing these columns is
  chain-safe — just two added field clears in the patch.

P0-17 — `notifications` table has no retention or erasure coverage
(round-1 #8, round-2 V6).
  In-app notifications carry the subject's peppered email + IP in
  `params` (lockout alerts, system messages). Without retention they
  accumulated indefinitely AND survived subject erasure.
  Fix:
   - Added `'notifications'` to `RETENTION_CATEGORIES` + policy schema
     fields. Wired into bounds-proposal map + bounds validator + clamp.
   - New `cleanupNotifications` action category: hard-delete on TTL
     (no two-pass trash — admin telemetry isn't user-restorable),
     gated by org-wide hold only.
   - New `listExpiredNotifications` query + `deleteExpiredNotification`
     mutation (cross-org guard + mid-flight org-hold re-check).
   - New `eraseSubjectNotifications` for Art 17 cascade: matches
     params.email against plaintext OR peppered-hash form so rows
     written under either pepper state are covered. Wired into
     `processErasureRequest`.

P1-F — 2FA writes plaintext email/IP to audit chain (round-1 #9,
round-2 V6). Switched 2FA's recordFailure / clearOnSuccess /
logEnrollmentEvent to splitEmailForAudit / splitIpForAudit shape;
matches login_attempts so a single TALE_AUDIT_PEPPER env-var flip
rotates the whole chain.

Verified: typecheck clean; 599 tests pass across affected dirs.
larryro added a commit that referenced this pull request May 9, 2026
…heck

- audit_hash: add lifecycleStatus + statusChangedAt to EXCLUDED_FIELDS so
  retention soft-delete (markRowExpiredGeneric) patching audit log rows
  doesn't poison the chain hash recompute. Pre-fix, ANY soft-deleted audit
  row caused verifyIntegrity to fail valid=false from that row forward.
  Round-2 review CRITICAL #8.

- audit_logs/validators: declare lifecycleStatus + statusChangedAt on
  auditLogItemValidator so query-return validation accepts soft-deleted
  rows. Defense-in-depth alongside the EXCLUDED_FIELDS fix.

- verify_integrity: anchor candidate filter accepts subtype === undefined
  (legacy retention checkpoints written before subtype field existed).
  Strict equality dropped them, breaking verifyIntegrity for any
  deployment that ran retention pre-upgrade. Match canonicalCheckpointPayload's
  `?? 'retention'` fallback. Round-2 review CRITICAL #9.

- verify_integrity: add fromTimestamp arg for paged resumption + suppress
  isFirstEntry head-anchor when paging mid-chain. Pre-fix, the response
  promised "page from lastVerifiedTimestamp + 1" but the query had no
  such arg — large-org chains could not be paged.

- verify_integrity: drop unsignedScrubSubjects Set (security-flavored dead
  code; unsignedScrubCount alone tracks the metric). The set was populated
  but never read; the actual gate is `!hasSigningKey`. Comment clarified.

- verify_integrity: type entries as Doc<'auditLogs'>[] instead of an
  open `[key: string]: unknown` index signature.

- audit_logs/internal_mutations: delete archiveOldLogs deprecated
  re-export — zero callers, AGENTS.md prohibits @deprecated tombstones.

- audit_logs/helpers (createAuditLog): introduce buildAuditRecordHashInput
  as single source of truth for the canonical record payload — both
  writer and self-check call it, eliminating drift risk that schema
  additions could change the hash output across writes vs verify.

- audit_logs/helpers (createAuditLog): genesis sentinel — read + patch
  the per-org auditLogChainGenesis row before each write. This forces
  OCC contention on a real document for the first audit write per org,
  closing the genesis-fork race where two concurrent first-writers both
  observe lastEntry=null and commit two roots with previousHash=''.
  Round-2 review CRITICAL #10.

- audit_logs/helpers (createAuditLog): inline self-check on every write
  recomputes the prior row's integrity hash and console.errors on
  mismatch. Catches naive scenario-1 tampering (field changed, hash not
  updated) at the next legitimate audit write — the only automated tamper
  detection today. Wrapped in try/catch and skips piiScrubbed rows so
  it cannot affect the legitimate write path. Round-2 review C.5.

Lint + typecheck clean. Convex codegen succeeded.
larryro added a commit that referenced this pull request May 9, 2026
…cies

Bundle of round-2-confirmed cross-tenant fixes plus the dead-code
delete of the semantic LLM response cache.

POLICY_TYPES drift (W6 #5)
- lib/shared/schemas/governance.ts now includes
  'data_classification_notice' to match the Convex enum, killing the
  `as const` cast at use-data-classification-notice.ts:50.

documents/compare_documents.ts (W6 #8)
- Convex `_storage` is a global namespace; org membership alone was
  not enough to gate `ctx.storage.getUrl`. Adds a JOIN through
  fileMetadata via the new internal query verifyStorageIdsBelongToOrg
  to confirm both `baseStorageId` and `comparisonStorageId` are owned
  by the caller's org. Refuses with a clear error otherwise. Pattern
  copied from agent_tools/documents/helpers/retrieve_document.ts.

file_metadata/actions.ts::checkFileRagStatuses (W6 #9)
- Was an unauthenticated public action that could flip any org's
  fileMetadata.ragStatus to `failed` via expireStaleRagQueue (DoS,
  pre-existing on `main`). Now requires `getAuthUser` and filters
  storageIds to ones owned by an org the caller is a member of via
  the new file_metadata.internal_queries.filterStorageIdsByCallerOrg.

governance/queries.ts (W6 #11)
- getPolicy + listPolicies now apply a member-readable allow-list
  (data_classification_notice, feature_flags, pii_config,
  chat_filter, personalization, upload_policy, default_models). All
  other types — login_policy.trustedProxies, password_policy,
  two_factor_policy, model_access.rules, budgets, retention_policy,
  moderation_provider.endpoint, system_prompt — are admin-only.
  listPolicies silently filters those out for non-admins.

semantic LLM response cache — DELETE (W6 #12 + #13)
- Round-2 v05 confirmed the lookup is structurally cross-tenant
  (filters only on agent_name, model, expires_at, similarity; ignores
  user_id / organization_id even though they're stored). The platform
  helpers `lookupSemanticCache` / `storeSemanticCacheAsync` had ZERO
  callers in the monorepo, the FastAPI router was mounted but
  unreachable from platform — a latent foot-gun primed for the next
  dev to wire up unaware. Deletes:

  - services/platform/convex/lib/response_cache/semantic_cache.ts
  - services/platform/convex/lib/response_cache/internal_actions.ts
  - services/rag/app/routers/llm_cache.py
  - services/rag/app/services/llm_response_cache.py

  Plus the corresponding imports in routers/__init__.py, main.py,
  rag_service.py. Also removes the two empty-catch violations in
  semantic_cache.ts (no longer applicable).

The exact-key Convex `lib/response_cache/{internal_mutations,
internal_queries}.ts` cache stays — it is the actually-wired one and
is correctly org-scoped.
larryro added a commit that referenced this pull request May 9, 2026
…eout

Round-2 v15 confirmed: /config unauthenticated, /openapi.json + /docs +
/redoc unauthenticated, RAG container ran as root, default token baked
into image ENV, strict-mode env name diverged across the wire,
non-constant-time token compare, plus three SSRF-guard gaps.

services/rag/app/auth.py
- W7 #3: hmac.compare_digest replaces == on the bearer compare. Removes
  the dead-code EXEMPT_PATHS frozenset.

services/rag/app/routers/health.py
- W7 #1: split into public_router (`/`, `/health`) and protected_router
  (`/config`). main.py mounts the protected one under
  Depends(verify_internal_token). Old `router` re-export stays for
  backwards compat.

services/rag/app/main.py
- W7 #2: docs_url / redoc_url / openapi_url are None outside debug.
- W7 #4: CORS allow_credentials flipped to False (bearer rides
  Authorization, never cookies).
- W7 #1 wiring: mount health-public + health-protected separately.

services/rag/app/config.py
- W7 #8: require_custom_internal_token accepts BOTH
  RAG_REQUIRE_CUSTOM_INTERNAL_TOKEN and TALE_REQUIRE_CUSTOM_RAG_TOKEN
  via pydantic AliasChoices.

services/rag/Dockerfile + services/convex/Dockerfile
- W7 #5: RAG container runs as non-root (uid:gid 1001:1001 `app`).
  RAG ingests untrusted PDFs/DOCX through native parsers; biggest
  blast radius in the stack, now hardened.
- W7 #6: removed RAG_INTERNAL_TOKEN=tale-rag-dev-only ENV bake from
  both runtime + scratch-squash stages and the matching bake in
  services/convex/Dockerfile. Operators MUST supply via env / compose
  / k8s secret.

services/platform/convex/lib/helpers/rag_config.ts
- W7 #9 F1: `redirect: 'manual'` on every ragFetch.
- W7 #9 F2: added fc00::/7 (IPv6 ULA) to v6 blocklist (AWS IPv6 IMDSv2).
- W7 #9 F3: strip trailing `.` before hostname blocklist lookup.
- W7 #9 F4: re-validate URL per ragFetch invocation (DNS rebinding +
  env rotation mitigation).
- W7 #9 F9: deleted path.startsWith('http') override branch (future-
  bypass foot-gun).

services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
- W7 #10: pass timeoutMs=60_000 (default 10s was a regression).
- Plus MAX_ITERATIONS=30 cap and "cursor did not advance" break to
  defend against an adversarial RAG response.
larryro added a commit that referenced this pull request May 9, 2026
… 2FA pepper

P0-16 — `scrubSubjectAuditLogs` doesn't clear `actorEmailHash` /
`actorIpHash` (round-1 #8, round-2 V6).
  Peppered hashes are pseudonymized PII per GDPR Art 4(5) — they're
  still personal data and must be cleared on Art 17 erasure. Without
  this, a subject's audit-chain entries kept a stable identifier even
  after `scrubSubjectAuditLogs` "scrubbed" them; re-identification was
  possible by the controller (or anyone with the pepper) by hashing a
  known email. The signed `pii_scrub` checkpoint window already permits
  the row's hash to diverge from its original (verifier skips chain
  re-compute inside the window), so clearing these columns is
  chain-safe — just two added field clears in the patch.

P0-17 — `notifications` table has no retention or erasure coverage
(round-1 #8, round-2 V6).
  In-app notifications carry the subject's peppered email + IP in
  `params` (lockout alerts, system messages). Without retention they
  accumulated indefinitely AND survived subject erasure.
  Fix:
   - Added `'notifications'` to `RETENTION_CATEGORIES` + policy schema
     fields. Wired into bounds-proposal map + bounds validator + clamp.
   - New `cleanupNotifications` action category: hard-delete on TTL
     (no two-pass trash — admin telemetry isn't user-restorable),
     gated by org-wide hold only.
   - New `listExpiredNotifications` query + `deleteExpiredNotification`
     mutation (cross-org guard + mid-flight org-hold re-check).
   - New `eraseSubjectNotifications` for Art 17 cascade: matches
     params.email against plaintext OR peppered-hash form so rows
     written under either pepper state are covered. Wired into
     `processErasureRequest`.

P1-F — 2FA writes plaintext email/IP to audit chain (round-1 #9,
round-2 V6). Switched 2FA's recordFailure / clearOnSuccess /
logEnrollmentEvent to splitEmailForAudit / splitIpForAudit shape;
matches login_attempts so a single TALE_AUDIT_PEPPER env-var flip
rotates the whole chain.

Verified: typecheck clean; 599 tests pass across affected dirs.
larryro added a commit that referenced this pull request May 9, 2026
…heck

- audit_hash: add lifecycleStatus + statusChangedAt to EXCLUDED_FIELDS so
  retention soft-delete (markRowExpiredGeneric) patching audit log rows
  doesn't poison the chain hash recompute. Pre-fix, ANY soft-deleted audit
  row caused verifyIntegrity to fail valid=false from that row forward.
  Round-2 review CRITICAL #8.

- audit_logs/validators: declare lifecycleStatus + statusChangedAt on
  auditLogItemValidator so query-return validation accepts soft-deleted
  rows. Defense-in-depth alongside the EXCLUDED_FIELDS fix.

- verify_integrity: anchor candidate filter accepts subtype === undefined
  (legacy retention checkpoints written before subtype field existed).
  Strict equality dropped them, breaking verifyIntegrity for any
  deployment that ran retention pre-upgrade. Match canonicalCheckpointPayload's
  `?? 'retention'` fallback. Round-2 review CRITICAL #9.

- verify_integrity: add fromTimestamp arg for paged resumption + suppress
  isFirstEntry head-anchor when paging mid-chain. Pre-fix, the response
  promised "page from lastVerifiedTimestamp + 1" but the query had no
  such arg — large-org chains could not be paged.

- verify_integrity: drop unsignedScrubSubjects Set (security-flavored dead
  code; unsignedScrubCount alone tracks the metric). The set was populated
  but never read; the actual gate is `!hasSigningKey`. Comment clarified.

- verify_integrity: type entries as Doc<'auditLogs'>[] instead of an
  open `[key: string]: unknown` index signature.

- audit_logs/internal_mutations: delete archiveOldLogs deprecated
  re-export — zero callers, AGENTS.md prohibits @deprecated tombstones.

- audit_logs/helpers (createAuditLog): introduce buildAuditRecordHashInput
  as single source of truth for the canonical record payload — both
  writer and self-check call it, eliminating drift risk that schema
  additions could change the hash output across writes vs verify.

- audit_logs/helpers (createAuditLog): genesis sentinel — read + patch
  the per-org auditLogChainGenesis row before each write. This forces
  OCC contention on a real document for the first audit write per org,
  closing the genesis-fork race where two concurrent first-writers both
  observe lastEntry=null and commit two roots with previousHash=''.
  Round-2 review CRITICAL #10.

- audit_logs/helpers (createAuditLog): inline self-check on every write
  recomputes the prior row's integrity hash and console.errors on
  mismatch. Catches naive scenario-1 tampering (field changed, hash not
  updated) at the next legitimate audit write — the only automated tamper
  detection today. Wrapped in try/catch and skips piiScrubbed rows so
  it cannot affect the legitimate write path. Round-2 review C.5.

Lint + typecheck clean. Convex codegen succeeded.
larryro added a commit that referenced this pull request May 17, 2026
Closes #9, #10, #11, #12 — cascade correctness + GC durability.

- `personalization_cascade.ts:cascadeOnOrgDeleted` swaps delete order:
  `db.delete` runs FIRST, then `storage.delete` inside the try/catch.
  Matches the documented contract in `tts/cascade_helpers.ts:55-62` —
  Convex `_storage` writes are out-of-band and not rolled back on tx
  abort, so the reverse order leaves a surviving row pointing at a
  dead storageId (404 on `/api/tts-audio`).
- `threads/cascade_helpers.ts` step 7c TTS cleanup gets the same swap,
  for the same reason.
- `cascadeOnTtsForMemberRemoved` per-mutation page cap lowered from
  50 (~10K writes) to 30 (~6K writes) to stay under Convex's ~8K
  per-mutation write budget. `cascadeOnOrgDeleted` gets the same cap
  reduction. The hourly cron picks up whatever doesn't fit in a single
  pass — still well inside the 30-day Art 12(3) GDPR window.
- `gcOrgTtsChunks` now persists its org-cursor in a new singleton
  `ttsGcCursor` table between cron runs. A deployment with more orgs
  than `MAX_ORGS_PER_RUN` now advances through the full org list over
  successive hours instead of restarting from the lex-first org every
  time and starving lex-tail orgs forever. On reaching the end of the
  org list the cursor wraps to null and the next run starts over.
- `gcOrgTtsChunks` skip-empty: an org with no rows older than the
  retention cutoff no longer counts against `MAX_ORGS_PER_RUN`. Without
  this, a busy tail of stale orgs sandwiched behind quiet lex-leading
  orgs would never get reaped.
larryro added a commit that referenced this pull request May 17, 2026
Closes #9, #10, #11, #12 — cascade correctness + GC durability.

- `personalization_cascade.ts:cascadeOnOrgDeleted` swaps delete order:
  `db.delete` runs FIRST, then `storage.delete` inside the try/catch.
  Matches the documented contract in `tts/cascade_helpers.ts:55-62` —
  Convex `_storage` writes are out-of-band and not rolled back on tx
  abort, so the reverse order leaves a surviving row pointing at a
  dead storageId (404 on `/api/tts-audio`).
- `threads/cascade_helpers.ts` step 7c TTS cleanup gets the same swap,
  for the same reason.
- `cascadeOnTtsForMemberRemoved` per-mutation page cap lowered from
  50 (~10K writes) to 30 (~6K writes) to stay under Convex's ~8K
  per-mutation write budget. `cascadeOnOrgDeleted` gets the same cap
  reduction. The hourly cron picks up whatever doesn't fit in a single
  pass — still well inside the 30-day Art 12(3) GDPR window.
- `gcOrgTtsChunks` now persists its org-cursor in a new singleton
  `ttsGcCursor` table between cron runs. A deployment with more orgs
  than `MAX_ORGS_PER_RUN` now advances through the full org list over
  successive hours instead of restarting from the lex-first org every
  time and starving lex-tail orgs forever. On reaching the end of the
  org list the cursor wraps to null and the next run starts over.
- `gcOrgTtsChunks` skip-empty: an org with no rows older than the
  retention cutoff no longer counts against `MAX_ORGS_PER_RUN`. Without
  this, a busy tail of stale orgs sandwiched behind quiet lex-leading
  orgs would never get reaped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant