Skip to content

senapatisantosh/DistributedLog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenTelemetry-Based Logging & Distributed Event-Log Tracking

A Staff-Level Architecture Learning Package for Healthcare IoT Observability


What This Repository Teaches

This repository is a comprehensive, architecture-first learning package for designing OpenTelemetry-based logging and distributed event-log tracking at scale. It targets a realistic domain: a healthcare IoT platform managing ~1 million medical devices with a hybrid infrastructure spanning Kubernetes microservices and legacy on-premises systems.

This is not a toy example. Every section addresses real tradeoffs that arise at staff-engineer scale: cost, cardinality, compliance, reliability, and operational complexity.


Who This Is For

  • Staff / Principal Engineers designing observability platforms
  • Platform / SRE teams building centralized logging pipelines
  • Architects integrating legacy and modern systems under a single observability umbrella
  • Engineers preparing for system design interviews on observability topics
  • Teams in regulated industries (healthcare, fintech, government) with compliance requirements

Domain Context

A healthcare IoT platform with:

  • ~1,000,000 medical devices in the field (patient monitors, infusion pumps, wearables, diagnostic equipment)
  • Devices sending telemetry and events upstream via MQTT/HTTP/gRPC
  • Backend platform services running in Kubernetes (ingestion, validation, enrichment, persistence, notification, analytics)
  • Some modern microservices (Go, Java, Python) with structured logging
  • Some legacy services on VMs with filesystem-based plain-text logs
  • Central observability requirements for operations, debugging, compliance, and incident response
  • Strict PHI/PII handling under HIPAA and related regulations


.NET 10 Implementation

This repository includes a working .NET 10 implementation that demonstrates the architecture in practice:

Quick Start

# Prerequisites: .NET 10 SDK, Docker, kubectl (minikube/kind/k3s)

# Option 1: Deploy everything to local K8s with Elasticsearch
./scripts/deploy.sh

# Option 2: Deploy with Datadog instead
./scripts/deploy.sh --with-datadog

# Test the pipeline
kubectl -n healthcare-iot port-forward svc/ingestion-service 8080:8080 &
./scripts/test-pipeline.sh

# View logs in Kibana
kubectl -n observability port-forward svc/kibana 5601:5601
# Open http://localhost:5601, create data view for "logs-*"

What the Implementation Includes

Component Technology Purpose
Shared Logging Library .NET 10, Serilog, OTel SDK Structured logging contract, event envelope, correlation middleware
Ingestion Service ASP.NET Core Minimal API Receives device alerts, generates correlation_id, forwards downstream
Validation Service ASP.NET Core Minimal API Validates alerts, logs event journal, forwards to notification
Notification Service ASP.NET Core Minimal API Sends notifications, logs success/failure event journal entries
OTel Collector DaemonSet OTel Collector Contrib Tails pod logs, enriches with K8s metadata, exports to ES
Elasticsearch + Kibana Elastic 8.17 Local log storage and visualization
Datadog Agent Datadog Agent 7 Alternative backend via OTLP ingestion

Architecture (Implementation)

Device Alert → Ingestion Service → Validation Service → Notification Service
                    │                      │                      │
                    └── stdout JSON ───────┴── stdout JSON ───────┘
                              │
                    OTel Collector DaemonSet
                    (filelog + k8sattributes)
                              │
                   ┌──────────┴──────────┐
                   │                     │
            Elasticsearch           Datadog (optional)
                   │
                Kibana

Log Output Example

Each service emits structured JSON to stdout, which the OTel Collector DaemonSet collects:

{
  "@t": "2024-01-15T10:30:45.123Z",
  "@l": "Information",
  "service.name": "ingestion-service",
  "service.version": "1.0.0",
  "environment": "production",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "correlation_id": "corr-abc123def456",
  "tenant_id": "HOSP-789",
  "device_id": "DEV-PM100-0042",
  "EventType": "device.alert.received",
  "Outcome": "success",
  "@mt": "[EventJournal] {EventType} | correlation={CorrelationId} event={EventId} outcome={Outcome}"
}

Repository Structure

.
├── README.md                                    # This file
├── global.json                                  # .NET SDK version
├── HealthcareIoT.sln                            # .NET solution file
├── src/
│   ├── HealthcareIoT.Logging.Shared/            # Shared structured logging library
│   │   ├── ObservabilityExtensions.cs            # One-call OTel + Serilog + ES/DD setup
│   │   ├── CorrelationMiddleware.cs              # HTTP correlation header propagation
│   │   ├── CorrelationContext.cs                 # Business correlation context
│   │   ├── EventEnvelope.cs                      # Distributed event log schema
│   │   ├── EventLogger.cs                        # Event journal structured logger
│   │   ├── LoggingConstants.cs                   # Standardized field names
│   │   └── DeviceAlertRequest.cs                 # Shared DTOs
│   ├── HealthcareIoT.Ingestion/                  # Ingestion microservice
│   ├── HealthcareIoT.Validation/                 # Validation microservice
│   └── HealthcareIoT.Notification/               # Notification microservice
├── k8s/
│   ├── base/namespace.yaml                       # K8s namespaces
│   ├── otel/otel-collector-daemonset.yaml        # OTel Collector DaemonSet + RBAC
│   ├── elasticsearch/elasticsearch.yaml          # ES + Kibana StatefulSet
│   ├── datadog/datadog-agent.yaml                # Datadog Agent DaemonSet
│   └── services/                                 # Service deployments
├── scripts/
│   ├── deploy.sh                                 # One-command deployment
│   └── test-pipeline.sh                          # End-to-end test
├── configs/                                      # Reference OTel/legacy/ELK configs
├── docs/
│   ├── observability-overview.md                # High-level architecture and philosophy
│   ├── opentelemetry-logging-foundations.md      # OTel logging concepts deep dive
│   ├── distributed-event-logging.md             # Distributed event-log tracking across services
│   ├── structured-logging-strategy.md           # Structured logging contract and schema design
│   ├── kubernetes-pod-log-collection.md         # K8s pod log collection patterns
│   ├── legacy-filesystem-log-collection.md      # Legacy file-based log collection
│   ├── legacy-log-transformation.md             # Transforming legacy logs to structured format
│   ├── elk-architecture.md                      # ELK pipeline architecture
│   ├── datadog-architecture.md                  # Datadog pipeline architecture
│   ├── hybrid-modern-and-legacy-logging.md      # Unified hybrid observability architecture
│   ├── log-correlation-strategy.md              # Correlation IDs, trace context, troubleshooting
│   ├── medical-device-platform-considerations.md # Healthcare IoT domain specifics
│   ├── security-and-compliance-for-logs.md      # PHI/PII, HIPAA, encryption, access control
│   ├── cost-cardinality-and-retention.md        # Cost management, cardinality, retention tiers
│   ├── deployment-patterns-for-collectors.md    # DaemonSet vs sidecar vs gateway patterns
│   ├── incident-debugging-playbook.md           # Incident response with logs
│   ├── elk-vs-datadog.md                        # Detailed comparison
│   └── staff-level-cheatsheet.md                # Quick-reference cheat sheets
├── configs/
│   ├── otel/
│   │   ├── collector-k8s-daemonset.yaml         # OTel Collector config for K8s pod logs
│   │   ├── collector-gateway.yaml               # Gateway collector config
│   │   └── collector-legacy-filelog.yaml         # Filelog receiver for legacy systems
│   ├── legacy/
│   │   ├── filebeat-legacy.yaml                 # Filebeat config for legacy tailing
│   │   └── logstash-transform-pipeline.conf     # Logstash transformation pipeline
│   ├── elk/
│   │   └── elasticsearch-ilm-policy.json        # Index lifecycle management
│   └── datadog/
│       └── datadog-agent-otel.yaml              # Datadog agent with OTLP ingestion
└── diagrams/
    └── architecture-diagrams.md                 # All Mermaid diagrams in one reference file

How to Use This Repository

  1. Start with docs/observability-overview.md for the big picture
  2. Deep dive into specific topics based on your interest
  3. Study the diagrams in diagrams/architecture-diagrams.md and inline in each doc
  4. Review configs in configs/ to see realistic collector/pipeline configurations
  5. Use the cheat sheets in docs/staff-level-cheatsheet.md for revision
  6. Follow the incident playbook in docs/incident-debugging-playbook.md for practical troubleshooting patterns

Reading Order (Recommended)

Order Document Why
1 observability-overview.md Understand the full architecture
2 opentelemetry-logging-foundations.md Understand OTel logging primitives
3 distributed-event-logging.md Understand event tracking across services
4 structured-logging-strategy.md Understand the logging contract
5 kubernetes-pod-log-collection.md Understand modern log collection
6 legacy-filesystem-log-collection.md Understand legacy log collection
7 legacy-log-transformation.md Understand transformation pipelines
8 log-correlation-strategy.md Understand correlation and debugging
9 elk-architecture.md Understand ELK pipeline
10 datadog-architecture.md Understand Datadog pipeline
11 hybrid-modern-and-legacy-logging.md Understand the unified architecture
12 deployment-patterns-for-collectors.md Understand deployment tradeoffs
13 medical-device-platform-considerations.md Understand domain specifics
14 security-and-compliance-for-logs.md Understand compliance requirements
15 cost-cardinality-and-retention.md Understand cost and scale
16 elk-vs-datadog.md Compare backends
17 incident-debugging-playbook.md Practice troubleshooting
18 staff-level-cheatsheet.md Quick revision

High-Level Architecture (Quick View)

graph TB
    subgraph "IoT Device Fleet (~1M devices)"
        D1[Patient Monitors]
        D2[Infusion Pumps]
        D3[Wearable Sensors]
        D4[Diagnostic Equipment]
    end

    subgraph "Ingestion Layer"
        MQTT[MQTT Broker]
        HTTPGW[HTTP/gRPC Gateway]
    end

    subgraph "Kubernetes Platform"
        SVC1[Ingestion Service]
        SVC2[Validation Service]
        SVC3[Enrichment Service]
        SVC4[Persistence Service]
        SVC5[Notification Service]
        SVC6[Analytics Pipeline]
        OTEL_DS[OTel Collector DaemonSet]
        OTEL_GW[OTel Collector Gateway]
    end

    subgraph "Legacy Systems"
        VM1[Legacy App Server VM]
        VM2[Legacy Database Server]
        VM3[On-Prem File Logger]
        AGENT[File Tailing Agent]
        TRANSFORM[Transform Pipeline]
    end

    subgraph "Observability Backends"
        ELK[ELK Stack]
        DD[Datadog]
    end

    subgraph "Consumers"
        SRE[SRE Team]
        DEV[Engineering]
        SEC[Security/Compliance]
        SUP[Support]
    end

    D1 & D2 & D3 & D4 --> MQTT & HTTPGW
    MQTT & HTTPGW --> SVC1
    SVC1 --> SVC2 --> SVC3 --> SVC4
    SVC4 --> SVC5 & SVC6

    SVC1 & SVC2 & SVC3 & SVC4 & SVC5 & SVC6 -->|stdout/stderr| OTEL_DS
    OTEL_DS -->|OTLP| OTEL_GW
    OTEL_GW -->|export| ELK & DD

    VM1 & VM2 & VM3 -->|file logs| AGENT
    AGENT --> TRANSFORM
    TRANSFORM -->|OTLP| OTEL_GW

    ELK & DD --> SRE & DEV & SEC & SUP
Loading

Key Principles

  1. Logs are a first-class signal — not an afterthought bolted onto metrics and traces
  2. Structured logging is non-negotiable at scale — unstructured text cannot be queried reliably across 1M devices
  3. Distributed event logs are distinct from diagnostic logs — they serve different consumers and have different retention/immutability requirements
  4. Legacy systems must be integrated, not ignored — transformation pipelines bridge the gap
  5. Correlation is the superpower — trace IDs, correlation IDs, and device event IDs make logs useful during incidents
  6. Cost is an architecture concern — at 1M devices, every field, every log level, every retention day has a dollar cost
  7. Compliance is not optional — PHI/PII redaction, audit trails, and access control are first-order design constraints in healthcare

License

This is an educational learning package. Use freely for learning, training, and internal architecture discussions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors