Skip to content

Operations Bundle

Tiana_ edited this page May 30, 2026 · 1 revision

Operations Bundle - Getting Started, Deployment, Helm, Monitoring, Runbook, Incident Response, FAQ, Troubleshooting

Operational Wiki pages bundled for compactness. Each section can be split into a standalone page if it grows.


§1. Getting Started

Prerequisites

  • Docker 24+ and Docker Compose v2
  • 8 GB RAM minimum (4 for the stack + 4 for your dev environment)
  • Java 21 (only if running services outside Docker)
  • curl, jq (for the demo script)

5-minute quickstart

git clone https://github.com/tiana-code/fincore-engine
cd fincore-engine
docker compose up -d

# Wait for services to be healthy (~30 seconds)
docker compose ps

# Run the demo
./scripts/demo.sh

The demo:

  1. Creates two accounts (USER_WALLET, EUR)
  2. Posts a 100 EUR transfer (double-entry transaction)
  3. Verifies balances match expected
  4. Demonstrates time-travel balance query
  5. Reverses the transaction
  6. Verifies idempotency (retry produces 1 transaction, not 2)

Endpoints

After docker compose up:

Service URL Default credentials
Ledger Service http://localhost:8080 get JWT from Keycloak
Swagger UI http://localhost:8080/swagger-ui.html -
Keycloak Admin http://localhost:8081 admin / admin
Grafana http://localhost:3000 admin / admin
Prometheus http://localhost:9090 -
Loki http://localhost:3100 -
Postgres localhost:5432 fincore / fincore
Redpanda Console http://localhost:8888 -

Get an access token

TOKEN=$(curl -s -X POST http://localhost:8081/realms/fincore/protocol/openid-connect/token \
  -d "grant_type=client_credentials" \
  -d "client_id=fincore-api-client" \
  -d "client_secret=demo-secret" \
  | jq -r '.access_token')

# Use it
curl http://localhost:8080/v1/accounts/some-id \
  -H "Authorization: Bearer $TOKEN"

Stop and clean up

docker compose down              # stops, keeps volumes
docker compose down -v           # stops + removes volumes (loses data)

§2. Deployment

Production deployment options

Option When to use
Helm chart on Kubernetes Recommended - most adopters
Plain docker compose Small / single-host POC
Manual JAR deployment Air-gapped environments
Cloud-managed (EKS, GKE, AKS) Enterprise - combine with Helm

Helm chart (production)

# Add the chart repo
helm repo add fincore https://tiana-code.github.io/fincore-helm-charts
helm repo update

# Install with production values
helm install fincore-engine fincore/fincore-engine \
  --namespace fincore-engine --create-namespace \
  --values values-prod.yaml

# Upgrade
helm upgrade fincore-engine fincore/fincore-engine \
  --values values-prod.yaml

values-prod.yaml template:

global:
  environment: production
  image:
    registry: ghcr.io/tiana-code
    pullPolicy: IfNotPresent
  imagePullSecrets:
    - name: ghcr-cred

ledger:
  replicaCount: 3
  hpa:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
  resources:
    requests: { memory: "1Gi", cpu: "500m" }
    limits:   { memory: "2Gi", cpu: "2"   }
  config:
    spring:
      profiles:
        active: prod
      datasource:
        url: ${VAULT_DB_URL}
        username: ${VAULT_DB_USER}
        password: ${VAULT_DB_PASS}

postgres:
  enabled: false   # use external managed Postgres
  externalUrl: jdbc:postgresql://prod-postgres.example.com:5432/fincore

redpanda:
  enabled: false   # use external Strimzi Kafka or MSK
  bootstrapServers: kafka-bootstrap.example.com:9092

keycloak:
  enabled: false   # use external Keycloak
  externalIssuerUri: https://auth.example.com/realms/fincore

resilience:
  circuitBreaker:
    bankAdapter:
      failureRateThreshold: 50
  rateLimit:
    perIp: 100
    perUser: 1000

observability:
  prometheus:
    enabled: true
    serviceMonitor: true
  loki:
    enabled: true
  tempo:
    enabled: true

security:
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 65532
  networkPolicy:
    enabled: true
    egressAllowedTo:
      - postgres
      - kafka
      - keycloak
      - external-providers   # configure CIDRs

Production checklist

Before going live:

  • External Postgres (HA primary + replicas, PITR enabled, encryption at rest)
  • External Kafka (Strimzi / MSK / Confluent)
  • External Keycloak (HA + DB)
  • External Redis (HA)
  • TLS certs for ingress (cert-manager + Let's Encrypt or commercial)
  • mTLS in cluster (Istio or Linkerd recommended)
  • Network policies (default-deny + explicit allows)
  • Secrets via Vault / AWS SM / GCP SM (not env vars)
  • Image pull from your private registry (re-tag from ghcr.io/tiana-code)
  • Backup strategy: WAL archiving, nightly snapshots, quarterly DR drill
  • Monitoring + alerting wired (PagerDuty / Slack)
  • Runbook reviewed
  • Pen test scheduled
  • SOC 2 readiness assessment (if regulated)

Kubernetes manifests (alternative to Helm)

Raw manifests in deploy/kubernetes/ for adopters who prefer kustomize / no-Helm. Maintained but Helm is the primary path.


§3. Helm Chart

Chart structure

deploy/helm/fincore-engine/
├── Chart.yaml
├── values.yaml                  # defaults (Redpanda + Keycloak bundled, dev mode)
├── values-prod.yaml             # production overrides
├── values-kafka.yaml            # Apache Kafka via Strimzi instead of Redpanda
├── values-observability.yaml    # full Grafana stack
└── templates/
    ├── _helpers.tpl
    ├── ledger-deployment.yaml
    ├── ledger-service.yaml
    ├── ledger-configmap.yaml
    ├── ledger-secret.yaml
    ├── ledger-hpa.yaml
    ├── ledger-pdb.yaml          # PodDisruptionBudget
    ├── ledger-servicemonitor.yaml
    ├── ledger-networkpolicy.yaml
    ├── postgres-statefulset.yaml
    ├── postgres-service.yaml
    ├── redpanda-statefulset.yaml
    ├── keycloak-deployment.yaml
    ├── keycloak-service.yaml
    ├── ingress.yaml
    └── (per-service templates for v0.2+)

values.yaml (top level)

global:
  environment: dev
  image:
    registry: ghcr.io/tiana-code
    tag: 0.1.0
    pullPolicy: IfNotPresent

ledger:
  enabled: true
  replicaCount: 1
  service:
    type: ClusterIP
    port: 8080
  config:
    spring:
      profiles:
        active: dev

postgres:
  enabled: true
  persistence:
    size: 10Gi
  credentials:
    database: fincore
    username: fincore
    password: fincore

redpanda:
  enabled: true
  resources:
    requests: { memory: "512Mi", cpu: "250m" }
    limits:   { memory: "1Gi",   cpu: "1"    }

keycloak:
  enabled: true
  realmImport: /opt/keycloak/data/import/fincore-realm.json
  adminUser: admin
  adminPassword: admin    # CHANGE IN PROD

observability:
  enabled: false           # toggle for Grafana stack

Customization recipes

# Disable bundled Postgres, use external
helm install fincore fincore/fincore-engine \
  --set postgres.enabled=false \
  --set ledger.config.spring.datasource.url=jdbc:postgresql://prod-db:5432/fincore

# Use Apache Kafka instead of Redpanda
helm install fincore fincore/fincore-engine -f values-kafka.yaml

# Enable observability
helm install fincore fincore/fincore-engine -f values-observability.yaml

# Production deployment
helm install fincore fincore/fincore-engine -f values-prod.yaml

Chart testing

helm lint deploy/helm/fincore-engine
helm template deploy/helm/fincore-engine | kubectl apply --dry-run=client -f -
helm test fincore-engine                  # runs chart-tests

CI (.github/workflows/helm-test.yml) runs all of the above on every PR.


§4. Monitoring

Required dashboards

See Architecture-Observability for the full list. Minimal must-haves:

  1. Service Health Overview - RPS, error rate, p99 latency, pod count, heap, GC
  2. Ledger Throughput - transactions posted/sec, balance read p99, invariant compliance
  3. Outbox & Event Flow - pending events, dispatcher lag, consumer lag, DLQ depth
  4. Resilience - circuit breaker states, retry counts, saga interventions

Dashboard JSONs in deploy/grafana/dashboards/.

Alert configuration

Defined as Prometheus rules in deploy/prometheus/rules.yaml. Severity matrix:

Severity Channel Ack SLA
P0 PagerDuty + phone 5 min
P1 PagerDuty 15 min
P2 Slack #engineering 1 hour
P3 Slack #monitoring next business day

Each alert links to a runbook entry (see §5 below).

Custom metrics

Each FinCore service exposes Micrometer metrics at /actuator/prometheus. Business metric naming:

  • ledger.transactions.posted.total (counter, labels: currency)
  • ledger.invariant.violation.total (counter - alert on any > 0)
  • payments.completed.total (counter)
  • decision.evaluation.duration (histogram)
  • outbox.events.pending (gauge per schema)
  • webhook.delivery.success.total (counter)

Full catalog in Architecture-Observability§custom-counters--timers.


§5. Runbook

Index of runbook entries

Each runbook entry lives in runbooks/<topic>.md of the repo and is linked from alert annotations.

Topic Severity When
ledger-invariant-violation.md P0 Any invariant violation reported
service-down.md P0 Service health probe fails
outbox-backlog.md P1 Pending events > 1000 for 5 min
consumer-lag.md P1 Consumer lag > 10000 for 5 min
circuit-breaker-open.md P1 Bank/KYC adapter circuit OPEN > 2 min
dlq-non-zero.md P2 DLQ has messages
idempotency-conflict-rate.md P2 Conflict rate > 1%
dispatcher-failed.md P0 Outbox dispatcher cannot publish
db-connection-pool-exhausted.md P1 HikariCP saturation > 90%
db-disk-full.md P0 Postgres disk usage > 90%

Runbook template

# Runbook: <topic>

## Severity: P0 - <one-line description>

## What this means
<2-3 sentences explaining the alert>

## Detection
- Alert: <alert name>
- Metric: <prometheus query>
- Logs: <loki query>

## Immediate actions
1. <step 1>
2. <step 2>
3. <step 3>

## Investigation
- <hypotheses>

## Recovery
- <recovery steps>

## Post-incident
- <follow-up requirements>

Example: Ledger invariant violation runbook

# Runbook: Ledger Invariant Violation
## Severity: P0 - Customer-impacting, possible data corruption

## What this means
The deferred trigger detected entries that don't sum to zero per currency.
This should be impossible - investigate immediately.

## Detection
- Alert: LedgerInvariantViolation
- Metric: rate(ledger_invariant_violation_total[1m]) > 0
- Logs: application="ledger-service" level=ERROR message=~"invariant"

## Immediate actions
1. Confirm the metric is real (cross-instance check)
2. Identify offending transaction:
   SELECT t.* FROM transactions t WHERE id IN (
     SELECT transaction_id FROM entries
     GROUP BY transaction_id, currency
     HAVING SUM(amount) <> 0
   );
3. If recent deploy: roll back. If not: page eng manager.

## Investigation
- Was the trigger disabled? `\df+ verify_double_entry_invariant`
- Was the materialized view stale?
- Was there a data import bypassing the trigger?

## Recovery
- DO NOT delete offending entries (immutable journal).
- Determine intended correct state.
- Post a compensating transaction if mathematically valid.
- Otherwise escalate to data-integrity working group.

## Post-incident
- Mandatory post-mortem within 48h.
- Root cause must include: how was invariant bypassed, why didn't tests catch.

§6. Incident Response

Incident lifecycle

DETECTED → TRIAGED → DIAGNOSED → MITIGATED → RESOLVED → POST-MORTEM

Roles during an incident

  • Incident Commander (IC) - coordinates, makes decisions, communicates
  • Subject Matter Expert (SME) - debugs, applies fixes
  • Communicator - keeps stakeholders informed (status page, Slack, email)
  • Scribe - captures timeline, decisions, action items

For solo maintainer - all four roles. For larger teams - separate.

Communication during P0/P1

Audience Channel Cadence
Internal (maintainers, on-call) Slack #incidents Continuous
Public (status page) status.fincore.dev Every 30 min
GitHub Discussions "Incident" category At start, mid, end
Sponsors (if affected) Email At start, end
Adopters (paid) Email + status page Continuous

Post-mortem template

# Post-mortem: <incident title>

## Date / time
- Detected: ...
- Resolved: ...
- Duration: ...

## Severity
- P0/P1/P2

## Impact
- Customer-facing: ...
- Internal: ...
- Data: ...

## Timeline (UTC)
- HH:MM - <what happened>
- HH:MM - <what we noticed>
- HH:MM - <what we tried>
- HH:MM - <what fixed it>

## Root cause
- <single sentence>
- <detailed analysis>

## What went well
- ...

## What went poorly
- ...

## Action items (owner + ETA)
- [ ] <action> - Maintainer - 2026-MM-DD
- [ ] <action> - ...

## Lessons learned
- ...

Every P0/P1 → post-mortem within 48 hours. Public for non-sensitive incidents (transparency).

Blameless culture

  • Focus on what failed, not who failed
  • Mistakes are systemic problems, not individual ones
  • Action items must be process or tooling, not "be more careful"

§7. FAQ

General

Q: Is FinCore Engine a fintech? No. It's open-source infrastructure for building fintech apps. We don't hold money, process payments, or have regulatory presence.

Q: Do I need a license to use FinCore Engine? For non-production use, embedded use in your own product, evaluation, contributions - no, BSL 1.1 grants free use. Only "competing managed service" requires commercial license. See ADR-0002.

Q: When does my use convert to Apache 2.0? Each release auto-converts 4 years after its release date. v0.1.0 (June 2026) → Apache 2.0 in June 2030.

Q: Can I fork it? Yes. Per the BSL, you can fork, modify, and use freely under the same terms.

Q: Is FinCore Engine production-ready? v0.1.0 is MVP - production for low-risk workloads. v1.0.0 (target Q2 2027) is the production-stable milestone with public SLOs and SOC 2 readiness checklist.

Technical

Q: Can I use my own database (not Postgres)? v0.1.0 is Postgres-only. v0.3.0 introduces TigerBeetle adapter. Other DBs not supported (out of scope).

Q: Can I use Apache Kafka instead of Redpanda? Yes. Same Kafka client code. Helm chart has values-kafka.yaml. Just point bootstrap-servers at your cluster.

Q: Can I run without Keycloak? Yes. Use any OIDC-compatible provider (Auth0, Okta, Cognito, internal). Just change issuer-uri.

Q: What's the difference between Decision Engine and Drools? Decision Engine is JSON-DSL, deterministic, audit-first, lightweight (Maven JAR), Kotlin-native. Drools is .drl-DSL, BRMS-heavy, JBoss-coupled. See ADR-0008 for the full rationale.

Q: How do I add a new payment provider? Implement the BankProvider interface, register as Spring bean, configure in application.yml. See services/payment/.../external/SandboxBankAdapter.kt as reference.

Q: How do I add ML? Implement RiskScorer or AnomalyDetector interface, register as Spring bean, configure. ML models stay private - only interfaces are in OSS.

Q: Can I use FinCore Engine for crypto? No mature support. The ledger handles arbitrary precision (NUMERIC(38,18)), so technically yes. But custodial wallet, KYC for crypto, on-chain interaction - out of scope.

Q: Does FinCore handle FX (currency conversion)? No. FinCore is single-currency per transaction. Cross-currency = your bank provider does the conversion; FinCore tracks both sides via separate transactions.

Roadmap

Q: When is v1.0? Q2 2027. See Roadmap.

Q: Will there be a hosted SaaS? Not in 2026. Maybe in 2028+ depending on signal. Self-hosted is the supported path for the foreseeable future.

Q: Can I influence the roadmap? Yes - issues, discussions, sponsorship at higher tiers. See Roadmap#how-priorities-are-decided.

Contributing

Q: Are PRs welcome? Yes. See CONTRIBUTING.md. For Y1, bug reports and integration testing are more valuable than feature PRs.

Q: Do I need to sign a CLA? For substantial PRs, yes - standard for BSL projects. The CLA grants the project the right to relicense your contribution as part of the project (necessary for the BSL → Apache 2.0 auto-conversion to work).

Q: Can I report security issues? See SECURITY.md. 48h ack, 30-day fix SLA for HIGH/CRITICAL.


§8. Troubleshooting

Common issues

docker compose up fails: port already in use Another service is using port 8080, 5432, 9092, 8081, or 3000. Stop it or override:

# docker-compose.override.yml (auto-loaded)
services:
  ledger:
    ports: ["18080:8080"]

Ledger service won't start: "Liquibase lock" A previous startup left a lock. Clear:

docker compose exec postgres psql -U fincore -c "UPDATE databasechangeloglock SET locked = FALSE;"

Keycloak auth: "invalid_client" Default sandbox client secret is demo-secret. Production needs your own.

Kafka producer: "TimeoutException" Redpanda not yet healthy. Wait 30s. If persistent: docker compose logs redpanda for errors.

Test fails: "Container failed to start" Testcontainers can't pull images. Ensure Docker is running. Pull manually:

docker pull postgres:17-alpine
docker pull redpandadata/redpanda:v24.3.1

./gradlew build: out of memory Increase Gradle heap:

export GRADLE_OPTS="-Xmx4g"
./gradlew build

Hibernate: LazyInitializationException Lazy collection accessed outside transaction. Use @EntityGraph or fetch eagerly. See Code-Rules§5.

docker compose down -v lost my data Yes, -v removes volumes. Don't use unless intentional.

MapStruct: "Mapper not generated" KSP needs annotation processor; check build.gradle.kts has ksp("org.mapstruct:mapstruct-processor:1.6.3"). Run ./gradlew kspKotlin.

Demo script fails: "jq not found" Install jq: brew install jq / apt install jq.

Where to get help

  1. Check this troubleshooting page
  2. Search GitHub Discussions
  3. Open a new Discussion (preferred) or Issue (for bugs)
  4. For commercial inquiries: email per SECURITY.md

Related

Clone this wiki locally