Operations Bundle

Operations Bundle - Getting Started, Deployment, Helm, Monitoring, Runbook, Incident Response, FAQ, Troubleshooting

Operational Wiki pages bundled for compactness. Each section can be split into a standalone page if it grows.

§1. Getting Started

Prerequisites

Docker 24+ and Docker Compose v2
8 GB RAM minimum (4 for the stack + 4 for your dev environment)
Java 21 (only if running services outside Docker)
curl, jq (for the demo script)

5-minute quickstart

git clone https://github.com/tiana-code/fincore-engine
cd fincore-engine
docker compose up -d

# Wait for services to be healthy (~30 seconds)
docker compose ps

# Run the demo
./scripts/demo.sh

The demo:

Creates two accounts (USER_WALLET, EUR)
Posts a 100 EUR transfer (double-entry transaction)
Verifies balances match expected
Demonstrates time-travel balance query
Reverses the transaction
Verifies idempotency (retry produces 1 transaction, not 2)

Endpoints

After docker compose up:

Service	URL	Default credentials
Ledger Service	http://localhost:8080	get JWT from Keycloak
Swagger UI	http://localhost:8080/swagger-ui.html	-
Keycloak Admin	http://localhost:8081	admin / admin
Grafana	http://localhost:3000	admin / admin
Prometheus	http://localhost:9090	-
Loki	http://localhost:3100	-
Postgres	localhost:5432	fincore / fincore
Redpanda Console	http://localhost:8888	-

Get an access token

TOKEN=$(curl -s -X POST http://localhost:8081/realms/fincore/protocol/openid-connect/token \
  -d "grant_type=client_credentials" \
  -d "client_id=fincore-api-client" \
  -d "client_secret=demo-secret" \
  | jq -r '.access_token')

# Use it
curl http://localhost:8080/v1/accounts/some-id \
  -H "Authorization: Bearer $TOKEN"

Stop and clean up

docker compose down              # stops, keeps volumes
docker compose down -v           # stops + removes volumes (loses data)

§2. Deployment

Production deployment options

Option	When to use
Helm chart on Kubernetes	Recommended - most adopters
Plain `docker compose`	Small / single-host POC
Manual JAR deployment	Air-gapped environments
Cloud-managed (EKS, GKE, AKS)	Enterprise - combine with Helm

Helm chart (production)

# Add the chart repo
helm repo add fincore https://tiana-code.github.io/fincore-helm-charts
helm repo update

# Install with production values
helm install fincore-engine fincore/fincore-engine \
  --namespace fincore-engine --create-namespace \
  --values values-prod.yaml

# Upgrade
helm upgrade fincore-engine fincore/fincore-engine \
  --values values-prod.yaml

values-prod.yaml template:

global:
  environment: production
  image:
    registry: ghcr.io/tiana-code
    pullPolicy: IfNotPresent
  imagePullSecrets:
    - name: ghcr-cred

ledger:
  replicaCount: 3
  hpa:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
  resources:
    requests: { memory: "1Gi", cpu: "500m" }
    limits:   { memory: "2Gi", cpu: "2"   }
  config:
    spring:
      profiles:
        active: prod
      datasource:
        url: ${VAULT_DB_URL}
        username: ${VAULT_DB_USER}
        password: ${VAULT_DB_PASS}

postgres:
  enabled: false   # use external managed Postgres
  externalUrl: jdbc:postgresql://prod-postgres.example.com:5432/fincore

redpanda:
  enabled: false   # use external Strimzi Kafka or MSK
  bootstrapServers: kafka-bootstrap.example.com:9092

keycloak:
  enabled: false   # use external Keycloak
  externalIssuerUri: https://auth.example.com/realms/fincore

resilience:
  circuitBreaker:
    bankAdapter:
      failureRateThreshold: 50
  rateLimit:
    perIp: 100
    perUser: 1000

observability:
  prometheus:
    enabled: true
    serviceMonitor: true
  loki:
    enabled: true
  tempo:
    enabled: true

security:
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 65532
  networkPolicy:
    enabled: true
    egressAllowedTo:
      - postgres
      - kafka
      - keycloak
      - external-providers   # configure CIDRs

Production checklist

Before going live:

Kubernetes manifests (alternative to Helm)

Raw manifests in deploy/kubernetes/ for adopters who prefer kustomize / no-Helm. Maintained but Helm is the primary path.

§3. Helm Chart

Chart structure

deploy/helm/fincore-engine/
├── Chart.yaml
├── values.yaml                  # defaults (Redpanda + Keycloak bundled, dev mode)
├── values-prod.yaml             # production overrides
├── values-kafka.yaml            # Apache Kafka via Strimzi instead of Redpanda
├── values-observability.yaml    # full Grafana stack
└── templates/
    ├── _helpers.tpl
    ├── ledger-deployment.yaml
    ├── ledger-service.yaml
    ├── ledger-configmap.yaml
    ├── ledger-secret.yaml
    ├── ledger-hpa.yaml
    ├── ledger-pdb.yaml          # PodDisruptionBudget
    ├── ledger-servicemonitor.yaml
    ├── ledger-networkpolicy.yaml
    ├── postgres-statefulset.yaml
    ├── postgres-service.yaml
    ├── redpanda-statefulset.yaml
    ├── keycloak-deployment.yaml
    ├── keycloak-service.yaml
    ├── ingress.yaml
    └── (per-service templates for v0.2+)

values.yaml (top level)

global:
  environment: dev
  image:
    registry: ghcr.io/tiana-code
    tag: 0.1.0
    pullPolicy: IfNotPresent

ledger:
  enabled: true
  replicaCount: 1
  service:
    type: ClusterIP
    port: 8080
  config:
    spring:
      profiles:
        active: dev

postgres:
  enabled: true
  persistence:
    size: 10Gi
  credentials:
    database: fincore
    username: fincore
    password: fincore

redpanda:
  enabled: true
  resources:
    requests: { memory: "512Mi", cpu: "250m" }
    limits:   { memory: "1Gi",   cpu: "1"    }

keycloak:
  enabled: true
  realmImport: /opt/keycloak/data/import/fincore-realm.json
  adminUser: admin
  adminPassword: admin    # CHANGE IN PROD

observability:
  enabled: false           # toggle for Grafana stack

Customization recipes

# Disable bundled Postgres, use external
helm install fincore fincore/fincore-engine \
  --set postgres.enabled=false \
  --set ledger.config.spring.datasource.url=jdbc:postgresql://prod-db:5432/fincore

# Use Apache Kafka instead of Redpanda
helm install fincore fincore/fincore-engine -f values-kafka.yaml

# Enable observability
helm install fincore fincore/fincore-engine -f values-observability.yaml

# Production deployment
helm install fincore fincore/fincore-engine -f values-prod.yaml

Chart testing

helm lint deploy/helm/fincore-engine
helm template deploy/helm/fincore-engine | kubectl apply --dry-run=client -f -
helm test fincore-engine                  # runs chart-tests

CI (.github/workflows/helm-test.yml) runs all of the above on every PR.

§4. Monitoring

Required dashboards

See Architecture-Observability for the full list. Minimal must-haves:

Service Health Overview - RPS, error rate, p99 latency, pod count, heap, GC
Ledger Throughput - transactions posted/sec, balance read p99, invariant compliance
Outbox & Event Flow - pending events, dispatcher lag, consumer lag, DLQ depth
Resilience - circuit breaker states, retry counts, saga interventions

Dashboard JSONs in deploy/grafana/dashboards/.

Alert configuration

Defined as Prometheus rules in deploy/prometheus/rules.yaml. Severity matrix:

Severity	Channel	Ack SLA
P0	PagerDuty + phone	5 min
P1	PagerDuty	15 min
P2	Slack #engineering	1 hour
P3	Slack #monitoring	next business day

Each alert links to a runbook entry (see §5 below).

Custom metrics

Each FinCore service exposes Micrometer metrics at /actuator/prometheus. Business metric naming:

ledger.transactions.posted.total (counter, labels: currency)
ledger.invariant.violation.total (counter - alert on any > 0)
payments.completed.total (counter)
decision.evaluation.duration (histogram)
outbox.events.pending (gauge per schema)
webhook.delivery.success.total (counter)

Full catalog in Architecture-Observability§custom-counters--timers.

§5. Runbook

Index of runbook entries

Each runbook entry lives in runbooks/<topic>.md of the repo and is linked from alert annotations.

Topic	Severity	When
`ledger-invariant-violation.md`	P0	Any invariant violation reported
`service-down.md`	P0	Service health probe fails
`outbox-backlog.md`	P1	Pending events > 1000 for 5 min
`consumer-lag.md`	P1	Consumer lag > 10000 for 5 min
`circuit-breaker-open.md`	P1	Bank/KYC adapter circuit OPEN > 2 min
`dlq-non-zero.md`	P2	DLQ has messages
`idempotency-conflict-rate.md`	P2	Conflict rate > 1%
`dispatcher-failed.md`	P0	Outbox dispatcher cannot publish
`db-connection-pool-exhausted.md`	P1	HikariCP saturation > 90%
`db-disk-full.md`	P0	Postgres disk usage > 90%

Runbook template

# Runbook: <topic>

## Severity: P0 - <one-line description>

## What this means
<2-3 sentences explaining the alert>

## Detection
- Alert: <alert name>
- Metric: <prometheus query>
- Logs: <loki query>

## Immediate actions
1. <step 1>
2. <step 2>
3. <step 3>

## Investigation
- <hypotheses>

## Recovery
- <recovery steps>

## Post-incident
- <follow-up requirements>

Example: Ledger invariant violation runbook

# Runbook: Ledger Invariant Violation
## Severity: P0 - Customer-impacting, possible data corruption

## What this means
The deferred trigger detected entries that don't sum to zero per currency.
This should be impossible - investigate immediately.

## Detection
- Alert: LedgerInvariantViolation
- Metric: rate(ledger_invariant_violation_total[1m]) > 0
- Logs: application="ledger-service" level=ERROR message=~"invariant"

## Immediate actions
1. Confirm the metric is real (cross-instance check)
2. Identify offending transaction:
   SELECT t.* FROM transactions t WHERE id IN (
     SELECT transaction_id FROM entries
     GROUP BY transaction_id, currency
     HAVING SUM(amount) <> 0
   );
3. If recent deploy: roll back. If not: page eng manager.

## Investigation
- Was the trigger disabled? `\df+ verify_double_entry_invariant`
- Was the materialized view stale?
- Was there a data import bypassing the trigger?

## Recovery
- DO NOT delete offending entries (immutable journal).
- Determine intended correct state.
- Post a compensating transaction if mathematically valid.
- Otherwise escalate to data-integrity working group.

## Post-incident
- Mandatory post-mortem within 48h.
- Root cause must include: how was invariant bypassed, why didn't tests catch.

§6. Incident Response

Incident lifecycle

DETECTED → TRIAGED → DIAGNOSED → MITIGATED → RESOLVED → POST-MORTEM

Roles during an incident

Incident Commander (IC) - coordinates, makes decisions, communicates
Subject Matter Expert (SME) - debugs, applies fixes
Communicator - keeps stakeholders informed (status page, Slack, email)
Scribe - captures timeline, decisions, action items

For solo maintainer - all four roles. For larger teams - separate.

Communication during P0/P1

Audience	Channel	Cadence
Internal (maintainers, on-call)	Slack #incidents	Continuous
Public (status page)	status.fincore.dev	Every 30 min
GitHub Discussions	"Incident" category	At start, mid, end
Sponsors (if affected)	Email	At start, end
Adopters (paid)	Email + status page	Continuous

Post-mortem template

# Post-mortem: <incident title>

## Date / time
- Detected: ...
- Resolved: ...
- Duration: ...

## Severity
- P0/P1/P2

## Impact
- Customer-facing: ...
- Internal: ...
- Data: ...

## Timeline (UTC)
- HH:MM - <what happened>
- HH:MM - <what we noticed>
- HH:MM - <what we tried>
- HH:MM - <what fixed it>

## Root cause
- <single sentence>
- <detailed analysis>

## What went well
- ...

## What went poorly
- ...

## Action items (owner + ETA)
- [ ] <action> - Maintainer - 2026-MM-DD
- [ ] <action> - ...

## Lessons learned
- ...

Every P0/P1 → post-mortem within 48 hours. Public for non-sensitive incidents (transparency).

Blameless culture

Focus on what failed, not who failed
Mistakes are systemic problems, not individual ones
Action items must be process or tooling, not "be more careful"

§7. FAQ

General

Q: Is FinCore Engine a fintech? No. It's open-source infrastructure for building fintech apps. We don't hold money, process payments, or have regulatory presence.

Q: Do I need a license to use FinCore Engine? For non-production use, embedded use in your own product, evaluation, contributions - no, BSL 1.1 grants free use. Only "competing managed service" requires commercial license. See ADR-0002.

Q: When does my use convert to Apache 2.0? Each release auto-converts 4 years after its release date. v0.1.0 (June 2026) → Apache 2.0 in June 2030.

Q: Can I fork it? Yes. Per the BSL, you can fork, modify, and use freely under the same terms.

Q: Is FinCore Engine production-ready? v0.1.0 is MVP - production for low-risk workloads. v1.0.0 (target Q2 2027) is the production-stable milestone with public SLOs and SOC 2 readiness checklist.

Technical

Q: Can I use my own database (not Postgres)? v0.1.0 is Postgres-only. v0.3.0 introduces TigerBeetle adapter. Other DBs not supported (out of scope).

Q: Can I use Apache Kafka instead of Redpanda? Yes. Same Kafka client code. Helm chart has values-kafka.yaml. Just point bootstrap-servers at your cluster.

Q: Can I run without Keycloak? Yes. Use any OIDC-compatible provider (Auth0, Okta, Cognito, internal). Just change issuer-uri.

Q: What's the difference between Decision Engine and Drools? Decision Engine is JSON-DSL, deterministic, audit-first, lightweight (Maven JAR), Kotlin-native. Drools is .drl-DSL, BRMS-heavy, JBoss-coupled. See ADR-0008 for the full rationale.

Q: How do I add a new payment provider? Implement the BankProvider interface, register as Spring bean, configure in application.yml. See services/payment/.../external/SandboxBankAdapter.kt as reference.

Q: How do I add ML? Implement RiskScorer or AnomalyDetector interface, register as Spring bean, configure. ML models stay private - only interfaces are in OSS.

Q: Can I use FinCore Engine for crypto? No mature support. The ledger handles arbitrary precision (NUMERIC(38,18)), so technically yes. But custodial wallet, KYC for crypto, on-chain interaction - out of scope.

Q: Does FinCore handle FX (currency conversion)? No. FinCore is single-currency per transaction. Cross-currency = your bank provider does the conversion; FinCore tracks both sides via separate transactions.

Roadmap

Q: When is v1.0? Q2 2027. See Roadmap.

Q: Will there be a hosted SaaS? Not in 2026. Maybe in 2028+ depending on signal. Self-hosted is the supported path for the foreseeable future.

Q: Can I influence the roadmap? Yes - issues, discussions, sponsorship at higher tiers. See Roadmap#how-priorities-are-decided.

Contributing

Q: Are PRs welcome? Yes. See CONTRIBUTING.md. For Y1, bug reports and integration testing are more valuable than feature PRs.

Q: Do I need to sign a CLA? For substantial PRs, yes - standard for BSL projects. The CLA grants the project the right to relicense your contribution as part of the project (necessary for the BSL → Apache 2.0 auto-conversion to work).

Q: Can I report security issues? See SECURITY.md. 48h ack, 30-day fix SLA for HIGH/CRITICAL.

§8. Troubleshooting

Common issues

docker compose up fails: port already in use Another service is using port 8080, 5432, 9092, 8081, or 3000. Stop it or override:

# docker-compose.override.yml (auto-loaded)
services:
  ledger:
    ports: ["18080:8080"]

Ledger service won't start: "Liquibase lock" A previous startup left a lock. Clear:

docker compose exec postgres psql -U fincore -c "UPDATE databasechangeloglock SET locked = FALSE;"

Keycloak auth: "invalid_client" Default sandbox client secret is demo-secret. Production needs your own.

Kafka producer: "TimeoutException" Redpanda not yet healthy. Wait 30s. If persistent: docker compose logs redpanda for errors.

Test fails: "Container failed to start" Testcontainers can't pull images. Ensure Docker is running. Pull manually:

docker pull postgres:17-alpine
docker pull redpandadata/redpanda:v24.3.1

./gradlew build: out of memory Increase Gradle heap:

export GRADLE_OPTS="-Xmx4g"
./gradlew build

Hibernate: LazyInitializationException Lazy collection accessed outside transaction. Use @EntityGraph or fetch eagerly. See Code-Rules§5.

docker compose down -v lost my data Yes, -v removes volumes. Don't use unless intentional.

MapStruct: "Mapper not generated" KSP needs annotation processor; check build.gradle.kts has ksp("org.mapstruct:mapstruct-processor:1.6.3"). Run ./gradlew kspKotlin.

Demo script fails: "jq not found" Install jq: brew install jq / apt install jq.

Where to get help

Check this troubleshooting page
Search GitHub Discussions
Open a new Discussion (preferred) or Issue (for bugs)
For commercial inquiries: email per SECURITY.md

Operations Bundle

Operations Bundle - Getting Started, Deployment, Helm, Monitoring, Runbook, Incident Response, FAQ, Troubleshooting

§1. Getting Started

Prerequisites

5-minute quickstart

Endpoints

Get an access token

Stop and clean up

§2. Deployment

Production deployment options

Helm chart (production)

Production checklist

Kubernetes manifests (alternative to Helm)

§3. Helm Chart

Chart structure

values.yaml (top level)

Customization recipes

Chart testing

§4. Monitoring

Required dashboards

Alert configuration

Custom metrics

§5. Runbook

Index of runbook entries

Runbook template

Example: Ledger invariant violation runbook

§6. Incident Response

Incident lifecycle

Roles during an incident

Communication during P0/P1

Post-mortem template

Blameless culture

§7. FAQ

General

Technical

Roadmap

Contributing

§8. Troubleshooting

Common issues

Where to get help

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!