You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Aperio's SIEM dispatcher (workers/siem-dispatcher.ts + internal/siemdispatcher/) is built for SOC consumption — finding-shaped envelopes pushed in near-real-time to Splunk / Panther / etc. That's correct for alerting, but wrong for two other audiences:
Security analysts doing longitudinal queries ("what's our 90-day median time-to-resolve?", "which OAuth scopes most often precede incidents?", "show me cross-tenant trends in our compliance scorecard"). They want SQL-shaped data in a warehouse, not finding-by-finding pushes.
Compliance needing append-only, multi-year retention without keeping hot Postgres rows around forever.
Aperio's Postgres is the hot transactional store; running analytical workloads against it is wrong, and growing it indefinitely is wrong.
Goals
Data warehouse export adapters for Snowflake, BigQuery, Databricks, and Apache Iceberg on S3.
For each (destination, table) pair the dispatcher tracks a high-water-mark (updatedAt, id) cursor. The dispatcher batches rows in fixed time windows (default 15 min), encodes per-table schema, and ships to the destination.
Adapters:
Snowflake — PUT Parquet files to an internal stage + COPY INTO.
BigQuery — bigquery.Inserter streaming inserts for hot rows; load jobs for backfill.
Databricks — dbsql driver; tables registered in Unity Catalog.
Iceberg on S3 — write Parquet + commit to Iceberg catalog (Glue / Nessie).
Parquet on S3 — partitioned by org_id / date.
JSONL on S3 — append-only files (cheapest, dumbest).
Postgres replica — for customers that want a logical read replica without managed warehouse.
Schema stability
Published per-table schemas in warehouse/schemas/<table>/v1.json. Schema changes are additive between minor versions; breaking changes bump major version + run a parallel dual-write window (v1 and v2 tables) before deprecating v1.
Cold-storage tier
When a row in the hot DB ages past Organization.dataRetentionDays:
Verify it's been delivered to at least one WarehouseDestination (or marked "no destination configured" via explicit operator override).
Insert a tombstone with id + delivered_to into a cold_storage_archive audit table.
Drop the hot row.
This bounds Postgres growth without losing any data the customer can prove they have in their warehouse.
Cold-storage tier wired to retention policy; tombstoning + recall path; analyst-facing example notebooks in warehouse/notebooks/
Open questions
How to handle late-arriving updates after watermark advances — re-emit + idempotent merges in the warehouse, or freeze the analytical surface at delivery time?
Per-table or per-destination delivery cadence — make it operator-tunable.
Cold-storage tombstone — is "we deleted this and you have it in your warehouse" enough for SOC 2, or do we need to keep the tombstone for the audit window?
References
Reuses: SiemDelivery durable-outbox pattern wholesale; tokenKeyVersion for credential rotation; existing JSONL SIEM destination as the v1 file shape.
Problem
Aperio's SIEM dispatcher (
workers/siem-dispatcher.ts+internal/siemdispatcher/) is built for SOC consumption — finding-shaped envelopes pushed in near-real-time to Splunk / Panther / etc. That's correct for alerting, but wrong for two other audiences:Aperio's Postgres is the hot transactional store; running analytical workloads against it is wrong, and growing it indefinitely is wrong.
Goals
IngestedEvent,SecurityFinding,RuleRun(Product observability: connector health, SIEM delivery, rule-run audit, /metrics #54),WorkflowDelivery(Workflow & ticketing integration: JIRA, Linear, Slack, Teams, PagerDuty, SLA tracking #50), and audit-log row gets archived to a customer-controlled object store with a configurable retention window in the hot Postgres.Non-goals
Proposed design
What to reuse
The SIEM dispatcher pattern (
internal/siemdispatcher/) is the right template:SiemDeliverytable semantics) — adapt to a newWarehouseDelivery.Deliver(ctx, payload)with retry/backoff + lease semantics.New schema
Watermark-based CDC
For each
(destination, table)pair the dispatcher tracks a high-water-mark(updatedAt, id)cursor. The dispatcher batches rows in fixed time windows (default 15 min), encodes per-table schema, and ships to the destination.Adapters:
PUTParquet files to an internal stage +COPY INTO.bigquery.Inserterstreaming inserts for hot rows; load jobs for backfill.dbsqldriver; tables registered in Unity Catalog.org_id/date.Schema stability
Published per-table schemas in
warehouse/schemas/<table>/v1.json. Schema changes are additive between minor versions; breaking changes bump major version + run a parallel dual-write window (v1andv2tables) before deprecatingv1.Cold-storage tier
When a row in the hot DB ages past
Organization.dataRetentionDays:WarehouseDestination(or marked "no destination configured" via explicit operator override).id + delivered_tointo acold_storage_archiveaudit table.This bounds Postgres growth without losing any data the customer can prove they have in their warehouse.
Phasing
FINDINGS,EVENTS,AUDIT_LOG;/admin/warehouseCRUD UIwarehouse/notebooks/Open questions
References
SiemDeliverydurable-outbox pattern wholesale;tokenKeyVersionfor credential rotation; existing JSONL SIEM destination as the v1 file shape.