Data-Intensive Applications Learning Repository
A hands-on .NET 10 learning repository for understanding the core principles behind data-intensive applications. Inspired by Martin Kleppmann's Designing Data-Intensive Applications , this project provides runnable simulations, comparison benchmarks, reference documentation covering replication, partitioning, database selection, file formats, and system design tradeoffs.
Topic
Description
Replication
Leader-follower, multi-leader, and leaderless replication strategies; conflict resolution; consistency guarantees; failover
Partitioning & Sharding
Hash, range, consistent hashing, list, time-based, composite partitioning; rebalancing; hot spot mitigation
Database Selection
SQL vs Document vs Graph vs Vector vs Time-Series vs Key-Value vs Wide-Column; decision matrices; real-world scenarios
File Formats
CSV, JSON, XML, Avro, Parquet, ORC, Protobuf; row vs columnar storage; schema evolution
System Design Tradeoffs
Consistency vs availability, normalization vs denormalization, latency vs throughput; comprehensive decision guides
graph TD
ROOT[DataIntensiveLearning]
ROOT --> DOCS[docs/]
ROOT --> SRC[src/]
ROOT --> TESTS[tests/]
ROOT --> DOCKER[docker/]
DOCS --> D1[replication/]
DOCS --> D2[partitioning/]
DOCS --> D3[database-selection/]
DOCS --> D4[file-formats/]
DOCS --> D5[system-design-scenarios/]
DOCS --> D6[decision-guides/]
SRC --> S1[Core — Shared models & interfaces]
SRC --> S2[Simulations — Runnable demos]
SRC --> S3[Partitioning — Strategy implementations]
SRC --> S4[FileFormats — Serialization examples]
SRC --> S5[DatabaseSelection — Advisor engine]
SRC --> S6[Api — ASP.NET Core scaffold]
TESTS --> T1[UnitTests — xUnit + FluentAssertions]
style ROOT fill:#2d3436,stroke:#dfe6e9,color:#dfe6e9
style DOCS fill:#0984e3,stroke:#dfe6e9,color:#fff
style SRC fill:#00b894,stroke:#dfe6e9,color:#fff
style TESTS fill:#fdcb6e,stroke:#2d3436,color:#2d3436
style DOCKER fill:#e17055,stroke:#dfe6e9,color:#fff
Loading
DistributedSystemWithBigData/
├── docs/
│ ├── replication/
│ │ ├── replication-overview.md # All replication strategies compared
│ │ ├── leader-follower.md # Single-leader deep dive
│ │ ├── multi-leader.md # Multi-leader + conflict resolution
│ │ ├── leaderless.md # Dynamo-style quorum replication
│ │ ├── consistency-and-lag.md # Replication lag + consistency models
│ │ └── failure-scenarios.md # Failover, split-brain, data loss
│ ├── partitioning/
│ │ ├── partitioning-overview.md # Partitioning vs sharding taxonomy
│ │ ├── hash-partitioning.md # Hash-based key distribution
│ │ ├── range-partitioning.md # Range-based for time-series + scans
│ │ ├── consistent-hashing.md # Hash ring + vnodes
│ │ ├── tenant-sharding.md # Multi-tenant isolation strategies
│ │ ├── hot-partitions-and-rebalancing.md # Skew detection + rebalancing
│ │ └── partition-key-selection.md # How to choose partition keys
│ ├── database-selection/
│ │ ├── sql-vs-nosql.md # Relational vs non-relational
│ │ ├── document-db-guide.md # MongoDB, Couchbase deep dive
│ │ ├── when-to-use-graph-db.md # Neo4j, Neptune use cases
│ │ ├── when-to-use-vector-db.md # Pinecone, pgvector, RAG
│ │ ├── when-to-use-timeseries-db.md # TimescaleDB, InfluxDB
│ │ ├── when-to-use-key-value-db.md # Redis, DynamoDB patterns
│ │ ├── when-to-use-wide-column-db.md # Cassandra, ScyllaDB
│ │ └── decision-matrix.md # Comprehensive comparison + flowcharts
│ ├── file-formats/
│ │ ├── csv.md # CSV deep dive
│ │ ├── json.md # JSON + JSON Lines
│ │ ├── xml.md # XML + XSD + SOAP
│ │ ├── avro.md # Avro + Schema Registry
│ │ ├── parquet.md # Columnar analytics format
│ │ ├── orc.md # Hive-native columnar format
│ │ ├── format-comparison-matrix.md # All formats compared
│ │ └── schema-evolution.md # Schema evolution patterns
│ ├── system-design-scenarios/
│ │ ├── ecommerce-platform.md # Orders, catalog, search, cache
│ │ ├── iot-platform.md # Sensor ingestion + time-series
│ │ ├── fraud-detection.md # Graph-based fraud ring detection
│ │ ├── observability-pipeline.md # Metrics, logs, traces
│ │ └── rag-search-platform.md # Vector DB + RAG architecture
│ ├── interview-revision-cheatsheet.md # Master cheat sheet for all topics
│ └── architecture-decision-flow.md # Decision flowcharts (Mermaid)
├── src/
│ ├── DataIntensiveLearning.Core/ # Shared domain models + abstractions
│ │ ├── Enums/ # DatabaseType, ReplicationType, etc.
│ │ ├── Interfaces/ # IPartitioner, IReplicationSimulator
│ │ └── Models/ # DataRecord, ReplicaNode, Partition
│ ├── DataIntensiveLearning.Simulations/ # Runnable console simulations
│ │ ├── Program.cs # Entry point (run with --scenario)
│ │ └── Replication/
│ │ ├── LeaderFollowerSimulator.cs # Sync/async replication + failover
│ │ ├── MultiLeaderSimulator.cs # Conflict detection + resolution
│ │ └── LeaderlessSimulator.cs # Quorum reads/writes + read repair
│ ├── DataIntensiveLearning.Partitioning/ # Partitioning strategy implementations
│ │ ├── HashPartitioner.cs # Hash-mod with distribution analysis
│ │ ├── RangePartitioner.cs # Boundary-based range partitioning
│ │ ├── ConsistentHashPartitioner.cs # Hash ring with virtual nodes
│ │ ├── ListPartitioner.cs # Explicit key-to-partition mapping
│ │ ├── TimeBasedPartitioner.cs # Time-bucketed partitioning
│ │ ├── CompositePartitioner.cs # Hash + time two-level partitioning
│ │ └── SkewAnalyzer.cs # Hot partition detection
│ ├── DataIntensiveLearning.FileFormats/ # File format examples
│ │ ├── CsvFormatExample.cs # CSV read/write
│ │ ├── JsonFormatExample.cs # JSON serialization
│ │ ├── XmlFormatExample.cs # XML serialization + namespaces
│ │ ├── AvroFormatExample.cs # Avro schema + evolution concepts
│ │ ├── ParquetFormatExample.cs # Columnar storage concepts
│ │ ├── OrcFormatExample.cs # ORC vs Parquet comparison
│ │ ├── ProtobufFormatExample.cs # Binary encoding + schema evolution
│ │ └── Formats/ # Reusable format implementations
│ ├── DataIntensiveLearning.DatabaseSelection/
│ │ ├── DatabaseSelectionAdvisor.cs # Rule-based recommendation engine
│ │ └── Models/ # WorkloadProfile, DatabaseRecommendation
│ └── DataIntensiveLearning.Api/ # ASP.NET Core API scaffold
├── tests/
│ └── DataIntensiveLearning.UnitTests/
│ ├── Partitioning/ # Hash, range, consistent hash, list, time, composite tests
│ ├── Replication/ # Quorum, leaderless simulator tests
│ ├── FileFormats/ # CSV, JSON serialization tests
│ └── DatabaseSelection/ # Advisor recommendation tests
├── docker/
│ └── Dockerfile.api # .NET 10 API container
├── docker-compose.yml # PostgreSQL, Redis, MongoDB, Neo4j, TimescaleDB
├── DataIntensiveLearning.sln # .NET solution file
└── README.md # This file
Path 1: Beginner (Foundations First)
graph LR
A[1. Replication] --> B[2. Partitioning]
B --> C[3. Database Selection]
C --> D[4. File Formats]
D --> E[5. System Design Scenarios]
style A fill:#0984e3,color:#fff
style B fill:#00b894,color:#fff
style C fill:#fdcb6e,color:#2d3436
style D fill:#e17055,color:#fff
style E fill:#6c5ce7,color:#fff
Loading
Replication — How data is copied across nodes and why consistency is hard.
Partitioning — How to split data across machines for horizontal scale.
Database Selection — Choosing the right database for each workload.
File Formats — How serialization and storage format choices affect performance.
System Design — Apply everything in realistic design exercises.
Path 2: Interview Prep (Fast Track)
Interview Cheat Sheet — One-page summaries and common questions.
Architecture Decision Flow — Decision flowcharts for databases, formats, partitioning.
Database Decision Matrix — "Choose X when Y" tables.
System Design Scenarios — Practice articulating design decisions.
Path 3: Deep Dive by Topic
Topic-by-Topic Navigation
Document
What You'll Learn
replication-overview.md
All three topologies compared, sync vs async, when to use each
leader-follower.md
Single-leader write flow, read replicas, failover process
multi-leader.md
Multi-datacenter replication, conflict resolution (LWW, merge, CRDTs)
leaderless.md
Quorum reads/writes (W+R>N), read repair, anti-entropy, sloppy quorums
consistency-and-lag.md
Read-after-write, monotonic reads, bounded staleness, causal consistency
failure-scenarios.md
Leader failure, split-brain, network partitions, data loss scenarios
Document
What You'll Learn
partitioning-overview.md
Partitioning vs sharding, horizontal vs vertical, OLTP vs OLAP
hash-partitioning.md
Hash functions, distribution analysis, the rehashing problem
range-partitioning.md
Boundary selection, time-series advantages, hot spot risks
consistent-hashing.md
Hash ring, virtual nodes, minimal redistribution on scaling
tenant-sharding.md
Multi-tenant isolation levels, geo-sharding, GDPR compliance
hot-partitions-and-rebalancing.md
Skew detection, key salting, rebalancing strategies
partition-key-selection.md
How to choose keys, cardinality, composite keys, common mistakes
Document
What You'll Learn
sql-vs-nosql.md
Relational vs non-relational: when to choose each
document-db-guide.md
MongoDB, Couchbase: flexible schema, embedding vs referencing
when-to-use-graph-db.md
Neo4j, Neptune: fraud detection, recommendations, knowledge graphs
when-to-use-vector-db.md
Pinecone, pgvector: semantic search, RAG, embeddings
when-to-use-timeseries-db.md
TimescaleDB, InfluxDB: IoT, metrics, downsampling
when-to-use-key-value-db.md
Redis, DynamoDB: caching, sessions, rate limiting
when-to-use-wide-column-db.md
Cassandra, ScyllaDB: massive write throughput, global distribution
decision-matrix.md
Decision trees, comparison tables, "choose X when Y" guides
Document
What You'll Learn
csv.md
Universal exchange, limitations, when to use
json.md
APIs, config, JSONL streaming, JSON Schema
xml.md
Enterprise integration, namespaces, SOAP
avro.md
Kafka serialization, schema evolution, Schema Registry
parquet.md
Columnar analytics, predicate pushdown, compression
orc.md
Hive ecosystem, stripe indexing, ACID support
format-comparison-matrix.md
All formats compared, decision tree, pipeline diagrams
schema-evolution.md
Backward/forward/full compatibility, migration patterns
# Run ALL simulations (replication, partitioning, file formats, db selection)
dotnet run --project src/DataIntensiveLearning.Simulations
# Run a specific simulation category
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario replication
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario partitioning
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario file-formats
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario db-selection
# Run all tests
dotnet test
# Run tests with detailed output
dotnet test --verbosity normal
# Run a specific test category
dotnet test --filter " FullyQualifiedName~Partitioning"
dotnet test --filter " FullyQualifiedName~Replication"
dotnet test --filter " FullyQualifiedName~DatabaseSelection"
The docker-compose.yml provides a full local environment with multiple database engines for hands-on comparison.
# Start all services
docker-compose up -d
# Verify all containers are healthy
docker-compose ps
# Stop all services
docker-compose down
# Stop and remove all data volumes
docker-compose down -v
Default connection strings (see docker-compose.yml for credentials):
Service
Connection
Purpose
PostgreSQL
Host=localhost;Port=5432;Database=datalearning;Username=admin;Password=admin
SQL/relational patterns
Redis
localhost:6379
Caching, sessions, key-value patterns
MongoDB
mongodb://admin:admin@localhost:27017
Document DB patterns
Neo4j
bolt://localhost:7687 (neo4j/adminpassword)
Graph DB patterns
TimescaleDB
Host=localhost;Port=5433;Database=timeseries;Username=admin;Password=admin
Time-series patterns
Key Tradeoffs at a Glance
Tradeoff
Option A
Option B
When to Choose A
When to Choose B
Consistency vs. Availability
Strong consistency
Eventual consistency
Financial transactions, inventory
Social feeds, analytics, caching
Normalization vs. Denormalization
Normalized (3NF)
Denormalized
Write-heavy, data integrity critical
Read-heavy, query performance critical
Row vs. Columnar Storage
Row-oriented
Column-oriented
OLTP, point lookups, frequent writes
OLAP, aggregations, scan-heavy queries
SQL vs. NoSQL
Relational DB
Document/KV/Graph DB
Complex joins, ACID needed
Flexible schema, horizontal scale
Replication: Sync vs. Async
Synchronous
Asynchronous
Durability guarantees required
Low latency, high throughput
Partitioning: Hash vs. Range
Hash partitioning
Range partitioning
Even distribution, no range queries
Range scans, time-series data
Binary vs. Text Formats
Avro/Protobuf/Parquet
JSON/CSV/XML
Performance, bandwidth, large data
Human readability, debugging, interchange
Dedicated vs. Embedded DB
Purpose-built (Neo4j, InfluxDB)
Extension (pgvector, TimescaleDB)
Max performance, specialized features
Reduce operational complexity
Simulator
What It Demonstrates
LeaderFollowerSimulator
Sync/async replication, replication lag, stale reads, follower failure/recovery, leader failover
MultiLeaderSimulator
Cross-datacenter replication, write conflicts, LWW resolution, custom merge functions
LeaderlessSimulator
Quorum writes (W nodes), quorum reads (R nodes), read repair, anti-entropy, sloppy quorums
Partitioning Implementations
Partitioner
What It Demonstrates
HashPartitioner
Hash-mod routing, distribution analysis, rehashing problem, hash function comparison
RangePartitioner
Boundary-based routing, range query support, partition pruning
ConsistentHashPartitioner
Hash ring, virtual nodes, minimal redistribution on node add/remove
ListPartitioner<T>
Explicit mapping, geo-routing, tenant isolation, default partition
TimeBasedPartitioner
Time-bucketed partitioning, retention policies, partition plans
CompositePartitioner
Two-level (hash + time) like Cassandra partition key + clustering column
SkewAnalyzer
CV calculation, hot partition detection, distribution health check
Database Selection Advisor
The DatabaseSelectionAdvisor evaluates workload profiles (read/write ratio, query complexity, consistency needs, scale, latency, data model) and recommends the best database type with reasoning, alternatives, and warnings. Run the predefined scenarios to see recommendations for e-commerce, IoT, fraud detection, RAG, caching, and more.
Example
What It Demonstrates
CSV/JSON/XML
Serialization, deserialization, format comparison
Avro
Schema evolution, compatibility modes, Schema Registry concepts
Parquet
Row vs columnar storage, predicate pushdown, column pruning, compression
ORC
Stripe indexing, ACID support, ORC vs Parquet comparison
Protobuf
Binary encoding (varint, tags), schema evolution, size comparison
This repository is for educational purposes. See LICENSE for details.