Architecture SLA SLI SLO

Architecture - SLA, SLI, SLO

Service-level agreements, indicators, objectives. Error budgets. Burn rate alerts. Reporting cadence. Companion to Architecture-Observability, Architecture-Resilience.

Definitions

Term	Meaning
SLA (Service Level Agreement)	External commitment to a customer with consequences (refunds, credits) on breach.
SLO (Service Level Objective)	Internal target. We aim for it; reliability decisions reference it.
SLI (Service Level Indicator)	The metric that measures whether we're meeting an SLO.
Error budget	`(1 - SLO) × period` - the allowable amount of unreliability before action.
Burn rate	Speed at which the error budget is consumed.

FinCore Engine OSS does not publish SLAs - adopters define their own with their customers. We publish SLO targets as best-effort recommendations based on production fintech industry norms.

Sources for the targets below:

Google SRE Book - SLO chapter
PSD2 RTS - Strong Customer Authentication API SLA requirements
ISO 20022 - payment messaging norms
NIST SP 800-53 - fintech control framework
SOC 2 Trust Services Criteria
Public status pages: Stripe, Wise, Adyen, Plaid, Revolut

SLI catalog (metrics-backed)

Availability SLIs

SLI	Definition	Metric expression
Ledger Post Availability	% of `POST /v1/transactions` returning non-5xx	`1 - (rate(http_server_requests_total{path="/v1/transactions",status=~"5.."}[5m]) / rate(http_server_requests_total{path="/v1/transactions"}[5m]))`
Read API Availability	% of GET requests returning non-5xx	similar with GETs
Payment Initiate Availability	% of `POST /v1/payments` returning non-5xx	similar
Decision Engine Availability	% of `POST /v1/decision/evaluate` returning non-5xx	similar
Webhook Subscription Delivery	% of webhooks delivered within 30s of event	`rate(webhook_delivered_within_30s[5m]) / rate(webhook_total_attempts[5m])`

Latency SLIs

SLI	Definition	Metric expression
Ledger Post p99	p99 of `POST /v1/transactions` server-side latency	`histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{path="/v1/transactions"}[5m]))`
Get Balance p99	p99 of `GET /v1/accounts/{id}/balance` (cached MV)	similar
Payment Initiate p99	p99 of `POST /v1/payments` (full sync flow)	similar
Decision Eval p99	p99 of decision engine evaluation	`histogram_quantile(0.99, rate(decision_evaluation_duration_bucket[5m]))`
Outbox Publish Lag p99	p99 time from outbox row creation to Kafka publish	`histogram_quantile(0.99, rate(outbox_dispatcher_lag_seconds_bucket[5m]))`

Correctness SLIs

SLI	Definition	Metric expression
Ledger Invariant Compliance	% of transactions that satisfied SUM=0 invariant	`1 - (rate(ledger_invariant_violation_total[1h]))` (zero violations expected; any violation = breach)
Idempotency Correctness	% of duplicate-key requests returning cached response	`rate(idempotency_replay_total[5m]) / rate(idempotency_check_total[5m])`
Outbox No-Loss	% of business writes that have a corresponding outbox row	offline reconciliation; alert on any miss

Durability / Data SLIs

SLI	Definition
Postgres RPO	Max data loss window in catastrophic failure
Backup Recoverability	Quarterly DR drill demonstrates restore-from-backup
Decision Log Retention	Decision logs retained 7+ years per regulatory requirement

SLO targets

v0.1 (MVP)

Conservative; we don't promise more than we can measure.

SLI	SLO Target	Error budget per 30 days	Source
Ledger Post Availability	99.9%	43m 12s	initial baseline
Read API Availability	99.95%	21m 36s	initial baseline
Ledger Post p99 latency	< 500ms	n/a (per-request budget)	architectural target
Get Balance p99 (cached)	< 100ms	n/a	architectural target
Decision Eval p99	< 50ms	n/a	architectural target (relaxed for v0.1)
Webhook delivery within 30s	99.5%	3h 36m	initial baseline
Outbox publish lag p99	< 5s	n/a	architectural target
Ledger invariant compliance	100%	0 (no tolerance)	regulatory
Idempotency correctness	100%	0 (no tolerance)	correctness

v1.0 (production-stable, target Y1 H2)

More aggressive; adopters can rely on these.

SLI	SLO Target	Error budget per 30 days
Ledger Post Availability	99.95%	21m 36s
Read API Availability	99.99%	4m 19s
Ledger Post p99 latency	< 300ms	-
Get Balance p99	< 50ms	-
Decision Eval p99	< 10ms	-
Webhook delivery within 30s	99.9%	43m 12s
Outbox publish lag p99	< 1s	-
Payment durability (no in-flight loss)	99.9999%	<2.6s/year
Ledger invariant compliance	100%	0
Idempotency correctness	100%	0
KYC verification turnaround p95	< 10 min auto, < 24h manual	-

v2.0 (high-availability target, Y2)

SLI	SLO Target
Ledger Post Availability	99.99%
Read API Availability	99.999% (5 nines)

Error budget policy

Each SLO has a 30-day rolling error budget.

Budget statuses

Budget	Action
Healthy (>50% remaining)	Normal release cadence, all features ship
Cautious (10-50% remaining)	Code review demands explicit reliability discussion. New risky features delayed.
Frozen (<10% remaining)	Feature releases blocked. Only reliability fixes. Post-mortem required for budget consumption.
Exhausted (<0%)	Page on-call. All hands on reliability. Post-mortem mandatory. SLO temporarily downgraded to allow recovery.

This policy is published, applies to maintainers, and reviewed quarterly.

Recovery from exhaustion

Stop the bleeding (whatever caused the breach)
Post-mortem within 48 hours
Action items prioritized over feature work
SLO target temporarily relaxed (not the SLI; we still measure honestly)
Once budget is healthy for 7 consecutive days, restore SLO to original target

Burn rate alerts

Per Google SRE workbook - multi-window, multi-burn-rate alerts.

For SLO 99.9% (error budget = 0.1%):

Burn rate	Detection window	Action
14.4×	1 hour	P0 - burns 30d budget in 2h
6×	6 hours	P1 - burns 30d budget in 5d
1×	3 days	P2 - burns 30d budget in 30d (track only)

Implementation:

- alert: AvailabilityBurnRateFast
  expr: |
    (
      sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[1h]))
      / sum(rate(http_server_requests_total{application="ledger-service"}[1h]))
    ) > (14.4 * 0.001)   # 14.4× burn against 99.9% SLO
  for: 5m
  labels: { severity: P0, slo: ledger_post_availability }

- alert: AvailabilityBurnRateMedium
  expr: |
    (
      sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[6h]))
      / sum(rate(http_server_requests_total{application="ledger-service"}[6h]))
    ) > (6 * 0.001)
  for: 30m
  labels: { severity: P1, slo: ledger_post_availability }

SLO dashboard

Single Grafana dashboard with one panel per SLO:

Current SLI value
30-day SLO target line
Error budget remaining (visualized as fuel gauge)
Burn rate over time
Recent SLO breaches

JSON in deploy/grafana/dashboards/slo.json.

Reporting cadence

Audience	Cadence	Format
Maintainers (internal)	Weekly	Slack post with SLI numbers + budget status
Maintainer team / contributors	Monthly	Release notes / forum post (transparency)
Public adopters	Quarterly	Status page update
Regulatory (eventually)	Annual	SOC 2 audit deliverable

What FinCore does NOT promise (yet)

No No public SLA in OSS - adopters bring their own
No No 24/7 on-call from upstream maintainers (community contracts only)
No No multi-region active-active in v0.1 (RTO 30 min, RPO 1 sec via PITR)
No No real-time fraud signal SLA - that's the adopter's responsibility through the RiskScorer plug-in
No No data residency guarantee in v0.1 - single-region deployment

These show up in the roadmap as adopters demand them.

Adopter SLA template

For adopters defining SLAs with their own customers, this template encodes our SLOs as adopter-side baselines:

## Service Level Agreement (SLA)

We commit to the following service levels, measured monthly:

### Availability
- Ledger and balance APIs: 99.95% uptime
- Payment API: 99.95% uptime
- Decision and compliance APIs: 99.9% uptime

### Latency (p99)
- Get balance: < 200ms
- Post transaction: < 500ms
- Initiate payment: < 1s
- Decision evaluation: < 100ms

### Webhooks
- Delivered within 30 seconds: 99.9%
- Maximum delivery attempts: 7 over 7 days

### Compensation
- Service availability < 99.9% in a billing month: 10% credit
- Service availability < 99% in a billing month: 25% credit
- Service availability < 95% in a billing month: 50% credit

### Exclusions
- Scheduled maintenance windows (announced 7 days in advance)
- Force majeure
- Customer-controlled infrastructure (their bank, their KYC provider, their LLM)

Adopters customize and present to their customers.

Private vertical layer

Private-vertical SLOs are documented privately. Not part of OSS commitments.

Architecture SLA SLI SLO

Architecture - SLA, SLI, SLO

Definitions

SLI catalog (metrics-backed)

Availability SLIs

Latency SLIs

Correctness SLIs

Durability / Data SLIs

SLO targets

v0.1 (MVP)

v1.0 (production-stable, target Y1 H2)

v2.0 (high-availability target, Y2)

Error budget policy

Budget statuses

Recovery from exhaustion

Burn rate alerts

SLO dashboard

Reporting cadence

What FinCore does NOT promise (yet)

Adopter SLA template

Private vertical layer

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Product

Architecture

Decisions (ADR)

Engineering

Delivery

Risk and Ops

Clone this wiki locally