Skip to content

Fix docker compose and Grafana observability#6

Merged
vladiant merged 6 commits into
mainfrom
fix-docker-compose
Mar 14, 2026
Merged

Fix docker compose and Grafana observability#6
vladiant merged 6 commits into
mainfrom
fix-docker-compose

Conversation

@vladiant
Copy link
Copy Markdown
Collaborator

Summary

This merge request fixes Docker Compose startup issues and improves Grafana observability accuracy by:

  1. Fixing the migrate-indexes container startup failure.
  2. Expanding p95 dashboard coverage with related histogram panels.
  3. Fixing Loki alert-rule loading errors in Grafana.
  4. Enforcing millisecond display units for latency p95 and histogram panels.
  5. Improving histogram precision by configuring explicit sub-second bucket boundaries in OpenTelemetry.

Scope of Changes

Changed files:

  • Dockerfile
  • app/observability/setup.py
  • observability/grafana/dashboards/commerce-observability.json
  • observability/grafana/dashboards/commerce-observability-p2.json
  • observability/grafana/dashboards/commerce-observability-p3.json
  • observability/grafana/provisioning/datasources/datasources.yml

Diff summary:

  • 6 files changed
  • 202 insertions, 1 deletion

What Was Fixed

1. Docker Compose startup failure (migrate-indexes)

  • Root cause: scripts/migrate_indexes.py was not present in the runtime image.
  • Fix: added COPY scripts/ ./scripts/ in Dockerfile.
  • Result: docker compose up -d --build now completes with migrate-indexes succeeding.

2. Missing histogram context for p95 panels

  • Added related histogram/heatmap panels for p95 charts across P1/P2/P3 dashboards.
  • This provides direct bucket distribution context next to p95 quantiles.

3. Grafana alerting error with Loki datasource

  • Error seen: Unable to fetch alert rules ... Is the Loki data source properly configured?
  • Root cause: Grafana attempted Loki managed-alert rule retrieval without Loki ruler support.
  • Fix: set manageAlerts: false for Loki datasource in datasources.yml.

4. Unit consistency for latency panels

  • Added explicit fieldConfig.defaults.unit: "ms" to all (ms) p95 panels.
  • Updated duration histogram panel titles and units to display in milliseconds for clarity.

5. Histogram quantile precision issue

  • Root cause: coarse/default histogram buckets caused p95 interpolation artifacts (for example values around ~4.75s).
  • Fix: configured explicit sub-second histogram bucket boundaries via OTel metric Views in app/observability/setup.py.
  • Applied to:
    • commerce_http_request_duration_seconds
    • commerce_http_response_time_seconds
    • commerce_http_processing_duration_seconds
    • commerce_http_queue_wait_duration_seconds
    • commerce_db_query_duration_seconds

Validation Performed

  • docker compose build completed successfully.
  • docker compose up -d --build succeeded after Dockerfile fix.
  • docker compose ps confirmed expected services up.
  • docker compose logs migrate-indexes confirmed successful migration execution.
  • Grafana logs checked after restart: no recurring Loki rule-fetch errors.
  • Dashboard JSON files validated with jq empty.
  • /metrics verified to expose new explicit histogram bucket boundaries (0.005, 0.01, 0.025, ...).

Notes for Reviewers

  • p95 values may take up to one full query window ([5m]) to fully reflect new bucket behavior after deployment.
  • Restart/rebuild is required for bucket/view configuration changes to take effect.
  • Grafana may require refresh/restart to immediately reflect provisioning updates.

@vladiant vladiant merged commit 2eae299 into main Mar 14, 2026
1 check passed
@vladiant vladiant deleted the fix-docker-compose branch March 14, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant