Skip to content

Architecture SLA SLI SLO

Tiana_ edited this page May 30, 2026 · 1 revision

Architecture - SLA, SLI, SLO

Service-level agreements, indicators, objectives. Error budgets. Burn rate alerts. Reporting cadence. Companion to Architecture-Observability, Architecture-Resilience.


Definitions

Term Meaning
SLA (Service Level Agreement) External commitment to a customer with consequences (refunds, credits) on breach.
SLO (Service Level Objective) Internal target. We aim for it; reliability decisions reference it.
SLI (Service Level Indicator) The metric that measures whether we're meeting an SLO.
Error budget (1 - SLO) × period - the allowable amount of unreliability before action.
Burn rate Speed at which the error budget is consumed.

FinCore Engine OSS does not publish SLAs - adopters define their own with their customers. We publish SLO targets as best-effort recommendations based on production fintech industry norms.

Sources for the targets below:


SLI catalog (metrics-backed)

Availability SLIs

SLI Definition Metric expression
Ledger Post Availability % of POST /v1/transactions returning non-5xx 1 - (rate(http_server_requests_total{path="/v1/transactions",status=~"5.."}[5m]) / rate(http_server_requests_total{path="/v1/transactions"}[5m]))
Read API Availability % of GET requests returning non-5xx similar with GETs
Payment Initiate Availability % of POST /v1/payments returning non-5xx similar
Decision Engine Availability % of POST /v1/decision/evaluate returning non-5xx similar
Webhook Subscription Delivery % of webhooks delivered within 30s of event rate(webhook_delivered_within_30s[5m]) / rate(webhook_total_attempts[5m])

Latency SLIs

SLI Definition Metric expression
Ledger Post p99 p99 of POST /v1/transactions server-side latency histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{path="/v1/transactions"}[5m]))
Get Balance p99 p99 of GET /v1/accounts/{id}/balance (cached MV) similar
Payment Initiate p99 p99 of POST /v1/payments (full sync flow) similar
Decision Eval p99 p99 of decision engine evaluation histogram_quantile(0.99, rate(decision_evaluation_duration_bucket[5m]))
Outbox Publish Lag p99 p99 time from outbox row creation to Kafka publish histogram_quantile(0.99, rate(outbox_dispatcher_lag_seconds_bucket[5m]))

Correctness SLIs

SLI Definition Metric expression
Ledger Invariant Compliance % of transactions that satisfied SUM=0 invariant 1 - (rate(ledger_invariant_violation_total[1h])) (zero violations expected; any violation = breach)
Idempotency Correctness % of duplicate-key requests returning cached response rate(idempotency_replay_total[5m]) / rate(idempotency_check_total[5m])
Outbox No-Loss % of business writes that have a corresponding outbox row offline reconciliation; alert on any miss

Durability / Data SLIs

SLI Definition
Postgres RPO Max data loss window in catastrophic failure
Backup Recoverability Quarterly DR drill demonstrates restore-from-backup
Decision Log Retention Decision logs retained 7+ years per regulatory requirement

SLO targets

v0.1 (MVP)

Conservative; we don't promise more than we can measure.

SLI SLO Target Error budget per 30 days Source
Ledger Post Availability 99.9% 43m 12s initial baseline
Read API Availability 99.95% 21m 36s initial baseline
Ledger Post p99 latency < 500ms n/a (per-request budget) architectural target
Get Balance p99 (cached) < 100ms n/a architectural target
Decision Eval p99 < 50ms n/a architectural target (relaxed for v0.1)
Webhook delivery within 30s 99.5% 3h 36m initial baseline
Outbox publish lag p99 < 5s n/a architectural target
Ledger invariant compliance 100% 0 (no tolerance) regulatory
Idempotency correctness 100% 0 (no tolerance) correctness

v1.0 (production-stable, target Y1 H2)

More aggressive; adopters can rely on these.

SLI SLO Target Error budget per 30 days
Ledger Post Availability 99.95% 21m 36s
Read API Availability 99.99% 4m 19s
Ledger Post p99 latency < 300ms -
Get Balance p99 < 50ms -
Decision Eval p99 < 10ms -
Webhook delivery within 30s 99.9% 43m 12s
Outbox publish lag p99 < 1s -
Payment durability (no in-flight loss) 99.9999% <2.6s/year
Ledger invariant compliance 100% 0
Idempotency correctness 100% 0
KYC verification turnaround p95 < 10 min auto, < 24h manual -

v2.0 (high-availability target, Y2)

SLI SLO Target
Ledger Post Availability 99.99%
Read API Availability 99.999% (5 nines)

Error budget policy

Each SLO has a 30-day rolling error budget.

Budget statuses

Budget Action
Healthy (>50% remaining) Normal release cadence, all features ship
Cautious (10-50% remaining) Code review demands explicit reliability discussion. New risky features delayed.
Frozen (<10% remaining) Feature releases blocked. Only reliability fixes. Post-mortem required for budget consumption.
Exhausted (<0%) Page on-call. All hands on reliability. Post-mortem mandatory. SLO temporarily downgraded to allow recovery.

This policy is published, applies to maintainers, and reviewed quarterly.

Recovery from exhaustion

  1. Stop the bleeding (whatever caused the breach)
  2. Post-mortem within 48 hours
  3. Action items prioritized over feature work
  4. SLO target temporarily relaxed (not the SLI; we still measure honestly)
  5. Once budget is healthy for 7 consecutive days, restore SLO to original target

Burn rate alerts

Per Google SRE workbook - multi-window, multi-burn-rate alerts.

For SLO 99.9% (error budget = 0.1%):

Burn rate Detection window Action
14.4× 1 hour P0 - burns 30d budget in 2h
6 hours P1 - burns 30d budget in 5d
3 days P2 - burns 30d budget in 30d (track only)

Implementation:

- alert: AvailabilityBurnRateFast
  expr: |
    (
      sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[1h]))
      / sum(rate(http_server_requests_total{application="ledger-service"}[1h]))
    ) > (14.4 * 0.001)   # 14.4× burn against 99.9% SLO
  for: 5m
  labels: { severity: P0, slo: ledger_post_availability }

- alert: AvailabilityBurnRateMedium
  expr: |
    (
      sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[6h]))
      / sum(rate(http_server_requests_total{application="ledger-service"}[6h]))
    ) > (6 * 0.001)
  for: 30m
  labels: { severity: P1, slo: ledger_post_availability }

SLO dashboard

Single Grafana dashboard with one panel per SLO:

  • Current SLI value
  • 30-day SLO target line
  • Error budget remaining (visualized as fuel gauge)
  • Burn rate over time
  • Recent SLO breaches

JSON in deploy/grafana/dashboards/slo.json.


Reporting cadence

Audience Cadence Format
Maintainers (internal) Weekly Slack post with SLI numbers + budget status
Maintainer team / contributors Monthly Release notes / forum post (transparency)
Public adopters Quarterly Status page update
Regulatory (eventually) Annual SOC 2 audit deliverable

What FinCore does NOT promise (yet)

  • No No public SLA in OSS - adopters bring their own
  • No No 24/7 on-call from upstream maintainers (community contracts only)
  • No No multi-region active-active in v0.1 (RTO 30 min, RPO 1 sec via PITR)
  • No No real-time fraud signal SLA - that's the adopter's responsibility through the RiskScorer plug-in
  • No No data residency guarantee in v0.1 - single-region deployment

These show up in the roadmap as adopters demand them.


Adopter SLA template

For adopters defining SLAs with their own customers, this template encodes our SLOs as adopter-side baselines:

## Service Level Agreement (SLA)

We commit to the following service levels, measured monthly:

### Availability
- Ledger and balance APIs: 99.95% uptime
- Payment API: 99.95% uptime
- Decision and compliance APIs: 99.9% uptime

### Latency (p99)
- Get balance: < 200ms
- Post transaction: < 500ms
- Initiate payment: < 1s
- Decision evaluation: < 100ms

### Webhooks
- Delivered within 30 seconds: 99.9%
- Maximum delivery attempts: 7 over 7 days

### Compensation
- Service availability < 99.9% in a billing month: 10% credit
- Service availability < 99% in a billing month: 25% credit
- Service availability < 95% in a billing month: 50% credit

### Exclusions
- Scheduled maintenance windows (announced 7 days in advance)
- Force majeure
- Customer-controlled infrastructure (their bank, their KYC provider, their LLM)

Adopters customize and present to their customers.


Private vertical layer

Private-vertical SLOs are documented privately. Not part of OSS commitments.


Related

Clone this wiki locally