Problem
Rule-based detection (#47) catches what we already know to look for. It doesn't catch the long tail of "this user logged in from Lagos at 3 AM with a brand-new device and immediately created 17 OAuth grants" — because no rule author thought to write that exact predicate.
UEBA (User and Entity Behavior Analytics) builds per-identity / per-entity baselines and alerts on deviations from the baseline, surfacing novel and slow-rolling attacks that escape rule packs.
Goals
- Per-identity activity baselines spanning every connected SaaS (login hour-of-day distribution, IP geo entropy, OAuth grant rate, repo push rate, MFA usage patterns).
- Deviation detection with configurable sensitivity per dimension; surfaces findings tagged
ueba.*.
- Peer-group baselining — compare an identity against its team/role peers (engineering vs. finance vs. HR).
- Asset-level baselines — repo commit cadence, channel post velocity, drive-share rate.
- Explainability — every UEBA finding must say which baseline dimension deviated, by how much, and over what window.
Non-goals
Proposed design
New domain model
enum BaselineKind {
IDENTITY_ACTIVITY
IDENTITY_AUTH_GEO
IDENTITY_OAUTH_GRANT_RATE
IDENTITY_PEER_GROUP
ASSET_ACTIVITY
}
model BehaviorBaseline {
id String @id @default(cuid())
organizationId String @map("organization_id")
personId String? @map("person_id") // null = asset-keyed
assetId String? @map("asset_id")
kind BaselineKind
dimension String @db.VarChar(120) // e.g. "login_hour_of_day", "source_country", "oauth_grants_per_day"
windowDays Int @map("window_days") // 30, 90
modelType String @map("model_type") @db.VarChar(40) // "zscore" | "iqr" | "mad" | "rate"
modelParams Json @map("model_params")
trainingStats Json @map("training_stats") // mean, stddev, p50, p90, p99, etc.
observations Int @default(0)
computedAt DateTime @default(now()) @map("computed_at")
expiresAt DateTime @map("expires_at")
organization Organization @relation(...)
@@unique([organizationId, personId, assetId, kind, dimension, windowDays])
@@index([organizationId, kind, computedAt])
@@map("behavior_baselines")
}
model BehaviorDeviation {
id String @id @default(cuid())
organizationId String @map("organization_id")
baselineId String @map("baseline_id")
observedValue Float @map("observed_value")
expectedValue Float @map("expected_value")
zScore Float? @map("z_score")
severity Severity
contextEventIds String[] @default([]) @map("context_event_ids")
findingId String? @map("finding_id")
observedAt DateTime @map("observed_at")
organization Organization @relation(...)
@@index([organizationId, baselineId, observedAt])
@@map("behavior_deviations")
}
Baseline builder worker
internal/ueba/baseline_builder.go:
- Runs nightly per org, per
BaselineKind × dimension × person/asset.
- Reads N-day rolling window of
IngestedEvent rows for the entity.
- Computes statistical model (z-score for continuous, multinomial for categorical like country/device, rate for count-based).
- Writes
BehaviorBaseline rows; old baselines age out via expiresAt.
Live scorer
internal/ueba/scorer.go:
- Subscribes to ingestion event bus.
- For each event, looks up applicable baselines for the actor and asset.
- Scores deviation; if above threshold, writes
BehaviorDeviation and opens a SecurityFinding (ruleKey = "ueba.<dimension>").
- Caches active baselines in memory; refreshes from DB hourly.
Peer-group baselining
Defines peer groups by:
Peer-group baselines are org-wide rather than per-identity; deviation = "this user is doing X more than 2σ above peers."
Explainability + UX
Every UEBA finding's evidence JSON contains:
{
"baseline_dimension": "source_country",
"training_window_days": 30,
"expected_distribution": {"US": 0.94, "DE": 0.05, "other": 0.01},
"observed_value": "NG",
"deviation_severity": "HIGH",
"context_event_ids": ["..."]
}
Finding card shows a tiny sparkline of the baseline + the deviation point — critical for operator trust.
Phasing
| Phase |
Scope |
| P1 |
Identity activity baselines (login hour, country, device); single-dimension z-score scorer; UEBA findings flowing through normal pipeline |
| P2 |
OAuth-grant-rate baseline; asset-activity baselines (repo push velocity, channel post rate) |
| P3 |
Peer-group baselining (requires #45 Person + HRIS sync); multinomial categorical models |
| P4 |
Optional pluggable ML scorer (BYO model); operator-tunable sensitivity per baseline; baseline rebuild on demand |
Open questions
- Storage budget for baselines — 1k baselines × 1k identities × 1k orgs is a lot; cap or downsample.
- Cold-start problem — how long until a baseline is "trusted enough" to alert on?
- False-positive throttling — should we suppress baselines until 30 days of clean activity exist?
- Pluggable model backend — start statistical, expose a Go interface that allows BYO model in P4.
References
Problem
Rule-based detection (#47) catches what we already know to look for. It doesn't catch the long tail of "this user logged in from Lagos at 3 AM with a brand-new device and immediately created 17 OAuth grants" — because no rule author thought to write that exact predicate.
UEBA (User and Entity Behavior Analytics) builds per-identity / per-entity baselines and alerts on deviations from the baseline, surfacing novel and slow-rolling attacks that escape rule packs.
Goals
ueba.*.Non-goals
Proposed design
New domain model
Baseline builder worker
internal/ueba/baseline_builder.go:BaselineKind × dimension × person/asset.IngestedEventrows for the entity.BehaviorBaselinerows; old baselines age out viaexpiresAt.Live scorer
internal/ueba/scorer.go:BehaviorDeviationand opens aSecurityFinding(ruleKey = "ueba.<dimension>").Peer-group baselining
Defines peer groups by:
Person.department(when set by HRIS in Identity-first correlation, lifecycle & non-human inventory (ISPM) #45)Person.jobTitleOkta.role,GitHub team,Slack user group)Peer-group baselines are org-wide rather than per-identity; deviation = "this user is doing X more than 2σ above peers."
Explainability + UX
Every UEBA finding's
evidenceJSON contains:{ "baseline_dimension": "source_country", "training_window_days": 30, "expected_distribution": {"US": 0.94, "DE": 0.05, "other": 0.01}, "observed_value": "NG", "deviation_severity": "HIGH", "context_event_ids": ["..."] }Finding card shows a tiny sparkline of the baseline + the deviation point — critical for operator trust.
Phasing
Open questions
References
IngestedEvent,SecurityFinding, the rule-engine evidence format from Detection-as-code: declarative YAML rules + community rule packs #47.