This repository is a comprehensive, architecture-first learning package for designing OpenTelemetry-based logging and distributed event-log tracking at scale. It targets a realistic domain: a healthcare IoT platform managing ~1 million medical devices with a hybrid infrastructure spanning Kubernetes microservices and legacy on-premises systems.
This is not a toy example. Every section addresses real tradeoffs that arise at staff-engineer scale: cost, cardinality, compliance, reliability, and operational complexity.
- Staff / Principal Engineers designing observability platforms
- Platform / SRE teams building centralized logging pipelines
- Architects integrating legacy and modern systems under a single observability umbrella
- Engineers preparing for system design interviews on observability topics
- Teams in regulated industries (healthcare, fintech, government) with compliance requirements
A healthcare IoT platform with:
- ~1,000,000 medical devices in the field (patient monitors, infusion pumps, wearables, diagnostic equipment)
- Devices sending telemetry and events upstream via MQTT/HTTP/gRPC
- Backend platform services running in Kubernetes (ingestion, validation, enrichment, persistence, notification, analytics)
- Some modern microservices (Go, Java, Python) with structured logging
- Some legacy services on VMs with filesystem-based plain-text logs
- Central observability requirements for operations, debugging, compliance, and incident response
- Strict PHI/PII handling under HIPAA and related regulations
This repository includes a working .NET 10 implementation that demonstrates the architecture in practice:
# Prerequisites: .NET 10 SDK, Docker, kubectl (minikube/kind/k3s)
# Option 1: Deploy everything to local K8s with Elasticsearch
./scripts/deploy.sh
# Option 2: Deploy with Datadog instead
./scripts/deploy.sh --with-datadog
# Test the pipeline
kubectl -n healthcare-iot port-forward svc/ingestion-service 8080:8080 &
./scripts/test-pipeline.sh
# View logs in Kibana
kubectl -n observability port-forward svc/kibana 5601:5601
# Open http://localhost:5601, create data view for "logs-*"| Component | Technology | Purpose |
|---|---|---|
| Shared Logging Library | .NET 10, Serilog, OTel SDK | Structured logging contract, event envelope, correlation middleware |
| Ingestion Service | ASP.NET Core Minimal API | Receives device alerts, generates correlation_id, forwards downstream |
| Validation Service | ASP.NET Core Minimal API | Validates alerts, logs event journal, forwards to notification |
| Notification Service | ASP.NET Core Minimal API | Sends notifications, logs success/failure event journal entries |
| OTel Collector DaemonSet | OTel Collector Contrib | Tails pod logs, enriches with K8s metadata, exports to ES |
| Elasticsearch + Kibana | Elastic 8.17 | Local log storage and visualization |
| Datadog Agent | Datadog Agent 7 | Alternative backend via OTLP ingestion |
Device Alert → Ingestion Service → Validation Service → Notification Service
│ │ │
└── stdout JSON ───────┴── stdout JSON ───────┘
│
OTel Collector DaemonSet
(filelog + k8sattributes)
│
┌──────────┴──────────┐
│ │
Elasticsearch Datadog (optional)
│
Kibana
Each service emits structured JSON to stdout, which the OTel Collector DaemonSet collects:
{
"@t": "2024-01-15T10:30:45.123Z",
"@l": "Information",
"service.name": "ingestion-service",
"service.version": "1.0.0",
"environment": "production",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"correlation_id": "corr-abc123def456",
"tenant_id": "HOSP-789",
"device_id": "DEV-PM100-0042",
"EventType": "device.alert.received",
"Outcome": "success",
"@mt": "[EventJournal] {EventType} | correlation={CorrelationId} event={EventId} outcome={Outcome}"
}.
├── README.md # This file
├── global.json # .NET SDK version
├── HealthcareIoT.sln # .NET solution file
├── src/
│ ├── HealthcareIoT.Logging.Shared/ # Shared structured logging library
│ │ ├── ObservabilityExtensions.cs # One-call OTel + Serilog + ES/DD setup
│ │ ├── CorrelationMiddleware.cs # HTTP correlation header propagation
│ │ ├── CorrelationContext.cs # Business correlation context
│ │ ├── EventEnvelope.cs # Distributed event log schema
│ │ ├── EventLogger.cs # Event journal structured logger
│ │ ├── LoggingConstants.cs # Standardized field names
│ │ └── DeviceAlertRequest.cs # Shared DTOs
│ ├── HealthcareIoT.Ingestion/ # Ingestion microservice
│ ├── HealthcareIoT.Validation/ # Validation microservice
│ └── HealthcareIoT.Notification/ # Notification microservice
├── k8s/
│ ├── base/namespace.yaml # K8s namespaces
│ ├── otel/otel-collector-daemonset.yaml # OTel Collector DaemonSet + RBAC
│ ├── elasticsearch/elasticsearch.yaml # ES + Kibana StatefulSet
│ ├── datadog/datadog-agent.yaml # Datadog Agent DaemonSet
│ └── services/ # Service deployments
├── scripts/
│ ├── deploy.sh # One-command deployment
│ └── test-pipeline.sh # End-to-end test
├── configs/ # Reference OTel/legacy/ELK configs
├── docs/
│ ├── observability-overview.md # High-level architecture and philosophy
│ ├── opentelemetry-logging-foundations.md # OTel logging concepts deep dive
│ ├── distributed-event-logging.md # Distributed event-log tracking across services
│ ├── structured-logging-strategy.md # Structured logging contract and schema design
│ ├── kubernetes-pod-log-collection.md # K8s pod log collection patterns
│ ├── legacy-filesystem-log-collection.md # Legacy file-based log collection
│ ├── legacy-log-transformation.md # Transforming legacy logs to structured format
│ ├── elk-architecture.md # ELK pipeline architecture
│ ├── datadog-architecture.md # Datadog pipeline architecture
│ ├── hybrid-modern-and-legacy-logging.md # Unified hybrid observability architecture
│ ├── log-correlation-strategy.md # Correlation IDs, trace context, troubleshooting
│ ├── medical-device-platform-considerations.md # Healthcare IoT domain specifics
│ ├── security-and-compliance-for-logs.md # PHI/PII, HIPAA, encryption, access control
│ ├── cost-cardinality-and-retention.md # Cost management, cardinality, retention tiers
│ ├── deployment-patterns-for-collectors.md # DaemonSet vs sidecar vs gateway patterns
│ ├── incident-debugging-playbook.md # Incident response with logs
│ ├── elk-vs-datadog.md # Detailed comparison
│ └── staff-level-cheatsheet.md # Quick-reference cheat sheets
├── configs/
│ ├── otel/
│ │ ├── collector-k8s-daemonset.yaml # OTel Collector config for K8s pod logs
│ │ ├── collector-gateway.yaml # Gateway collector config
│ │ └── collector-legacy-filelog.yaml # Filelog receiver for legacy systems
│ ├── legacy/
│ │ ├── filebeat-legacy.yaml # Filebeat config for legacy tailing
│ │ └── logstash-transform-pipeline.conf # Logstash transformation pipeline
│ ├── elk/
│ │ └── elasticsearch-ilm-policy.json # Index lifecycle management
│ └── datadog/
│ └── datadog-agent-otel.yaml # Datadog agent with OTLP ingestion
└── diagrams/
└── architecture-diagrams.md # All Mermaid diagrams in one reference file
- Start with
docs/observability-overview.mdfor the big picture - Deep dive into specific topics based on your interest
- Study the diagrams in
diagrams/architecture-diagrams.mdand inline in each doc - Review configs in
configs/to see realistic collector/pipeline configurations - Use the cheat sheets in
docs/staff-level-cheatsheet.mdfor revision - Follow the incident playbook in
docs/incident-debugging-playbook.mdfor practical troubleshooting patterns
| Order | Document | Why |
|---|---|---|
| 1 | observability-overview.md |
Understand the full architecture |
| 2 | opentelemetry-logging-foundations.md |
Understand OTel logging primitives |
| 3 | distributed-event-logging.md |
Understand event tracking across services |
| 4 | structured-logging-strategy.md |
Understand the logging contract |
| 5 | kubernetes-pod-log-collection.md |
Understand modern log collection |
| 6 | legacy-filesystem-log-collection.md |
Understand legacy log collection |
| 7 | legacy-log-transformation.md |
Understand transformation pipelines |
| 8 | log-correlation-strategy.md |
Understand correlation and debugging |
| 9 | elk-architecture.md |
Understand ELK pipeline |
| 10 | datadog-architecture.md |
Understand Datadog pipeline |
| 11 | hybrid-modern-and-legacy-logging.md |
Understand the unified architecture |
| 12 | deployment-patterns-for-collectors.md |
Understand deployment tradeoffs |
| 13 | medical-device-platform-considerations.md |
Understand domain specifics |
| 14 | security-and-compliance-for-logs.md |
Understand compliance requirements |
| 15 | cost-cardinality-and-retention.md |
Understand cost and scale |
| 16 | elk-vs-datadog.md |
Compare backends |
| 17 | incident-debugging-playbook.md |
Practice troubleshooting |
| 18 | staff-level-cheatsheet.md |
Quick revision |
graph TB
subgraph "IoT Device Fleet (~1M devices)"
D1[Patient Monitors]
D2[Infusion Pumps]
D3[Wearable Sensors]
D4[Diagnostic Equipment]
end
subgraph "Ingestion Layer"
MQTT[MQTT Broker]
HTTPGW[HTTP/gRPC Gateway]
end
subgraph "Kubernetes Platform"
SVC1[Ingestion Service]
SVC2[Validation Service]
SVC3[Enrichment Service]
SVC4[Persistence Service]
SVC5[Notification Service]
SVC6[Analytics Pipeline]
OTEL_DS[OTel Collector DaemonSet]
OTEL_GW[OTel Collector Gateway]
end
subgraph "Legacy Systems"
VM1[Legacy App Server VM]
VM2[Legacy Database Server]
VM3[On-Prem File Logger]
AGENT[File Tailing Agent]
TRANSFORM[Transform Pipeline]
end
subgraph "Observability Backends"
ELK[ELK Stack]
DD[Datadog]
end
subgraph "Consumers"
SRE[SRE Team]
DEV[Engineering]
SEC[Security/Compliance]
SUP[Support]
end
D1 & D2 & D3 & D4 --> MQTT & HTTPGW
MQTT & HTTPGW --> SVC1
SVC1 --> SVC2 --> SVC3 --> SVC4
SVC4 --> SVC5 & SVC6
SVC1 & SVC2 & SVC3 & SVC4 & SVC5 & SVC6 -->|stdout/stderr| OTEL_DS
OTEL_DS -->|OTLP| OTEL_GW
OTEL_GW -->|export| ELK & DD
VM1 & VM2 & VM3 -->|file logs| AGENT
AGENT --> TRANSFORM
TRANSFORM -->|OTLP| OTEL_GW
ELK & DD --> SRE & DEV & SEC & SUP
- Logs are a first-class signal — not an afterthought bolted onto metrics and traces
- Structured logging is non-negotiable at scale — unstructured text cannot be queried reliably across 1M devices
- Distributed event logs are distinct from diagnostic logs — they serve different consumers and have different retention/immutability requirements
- Legacy systems must be integrated, not ignored — transformation pipelines bridge the gap
- Correlation is the superpower — trace IDs, correlation IDs, and device event IDs make logs useful during incidents
- Cost is an architecture concern — at 1M devices, every field, every log level, every retention day has a dollar cost
- Compliance is not optional — PHI/PII redaction, audit trails, and access control are first-order design constraints in healthcare
This is an educational learning package. Use freely for learning, training, and internal architecture discussions.