Skip to content

senapatisantosh/DistributedSystemWithBigData

Repository files navigation

Data-Intensive Applications Learning Repository

A hands-on .NET 10 learning repository for understanding the core principles behind data-intensive applications. Inspired by Martin Kleppmann's Designing Data-Intensive Applications, this project provides runnable simulations, comparison benchmarks, reference documentation covering replication, partitioning, database selection, file formats, and system design tradeoffs.


What You Will Learn

Topic Description
Replication Leader-follower, multi-leader, and leaderless replication strategies; conflict resolution; consistency guarantees; failover
Partitioning & Sharding Hash, range, consistent hashing, list, time-based, composite partitioning; rebalancing; hot spot mitigation
Database Selection SQL vs Document vs Graph vs Vector vs Time-Series vs Key-Value vs Wide-Column; decision matrices; real-world scenarios
File Formats CSV, JSON, XML, Avro, Parquet, ORC, Protobuf; row vs columnar storage; schema evolution
System Design Tradeoffs Consistency vs availability, normalization vs denormalization, latency vs throughput; comprehensive decision guides

Repository Map

graph TD
    ROOT[DataIntensiveLearning]

    ROOT --> DOCS[docs/]
    ROOT --> SRC[src/]
    ROOT --> TESTS[tests/]
    ROOT --> DOCKER[docker/]

    DOCS --> D1[replication/]
    DOCS --> D2[partitioning/]
    DOCS --> D3[database-selection/]
    DOCS --> D4[file-formats/]
    DOCS --> D5[system-design-scenarios/]
    DOCS --> D6[decision-guides/]

    SRC --> S1[Core — Shared models & interfaces]
    SRC --> S2[Simulations — Runnable demos]
    SRC --> S3[Partitioning — Strategy implementations]
    SRC --> S4[FileFormats — Serialization examples]
    SRC --> S5[DatabaseSelection — Advisor engine]
    SRC --> S6[Api — ASP.NET Core scaffold]

    TESTS --> T1[UnitTests — xUnit + FluentAssertions]

    style ROOT fill:#2d3436,stroke:#dfe6e9,color:#dfe6e9
    style DOCS fill:#0984e3,stroke:#dfe6e9,color:#fff
    style SRC fill:#00b894,stroke:#dfe6e9,color:#fff
    style TESTS fill:#fdcb6e,stroke:#2d3436,color:#2d3436
    style DOCKER fill:#e17055,stroke:#dfe6e9,color:#fff
Loading

Repository Structure

DistributedSystemWithBigData/
├── docs/
│   ├── replication/
│   │   ├── replication-overview.md         # All replication strategies compared
│   │   ├── leader-follower.md              # Single-leader deep dive
│   │   ├── multi-leader.md                 # Multi-leader + conflict resolution
│   │   ├── leaderless.md                   # Dynamo-style quorum replication
│   │   ├── consistency-and-lag.md          # Replication lag + consistency models
│   │   └── failure-scenarios.md            # Failover, split-brain, data loss
│   ├── partitioning/
│   │   ├── partitioning-overview.md        # Partitioning vs sharding taxonomy
│   │   ├── hash-partitioning.md            # Hash-based key distribution
│   │   ├── range-partitioning.md           # Range-based for time-series + scans
│   │   ├── consistent-hashing.md           # Hash ring + vnodes
│   │   ├── tenant-sharding.md              # Multi-tenant isolation strategies
│   │   ├── hot-partitions-and-rebalancing.md # Skew detection + rebalancing
│   │   └── partition-key-selection.md       # How to choose partition keys
│   ├── database-selection/
│   │   ├── sql-vs-nosql.md                 # Relational vs non-relational
│   │   ├── document-db-guide.md            # MongoDB, Couchbase deep dive
│   │   ├── when-to-use-graph-db.md         # Neo4j, Neptune use cases
│   │   ├── when-to-use-vector-db.md        # Pinecone, pgvector, RAG
│   │   ├── when-to-use-timeseries-db.md    # TimescaleDB, InfluxDB
│   │   ├── when-to-use-key-value-db.md     # Redis, DynamoDB patterns
│   │   ├── when-to-use-wide-column-db.md   # Cassandra, ScyllaDB
│   │   └── decision-matrix.md              # Comprehensive comparison + flowcharts
│   ├── file-formats/
│   │   ├── csv.md                          # CSV deep dive
│   │   ├── json.md                         # JSON + JSON Lines
│   │   ├── xml.md                          # XML + XSD + SOAP
│   │   ├── avro.md                         # Avro + Schema Registry
│   │   ├── parquet.md                      # Columnar analytics format
│   │   ├── orc.md                          # Hive-native columnar format
│   │   ├── format-comparison-matrix.md     # All formats compared
│   │   └── schema-evolution.md             # Schema evolution patterns
│   ├── system-design-scenarios/
│   │   ├── ecommerce-platform.md           # Orders, catalog, search, cache
│   │   ├── iot-platform.md                 # Sensor ingestion + time-series
│   │   ├── fraud-detection.md              # Graph-based fraud ring detection
│   │   ├── observability-pipeline.md       # Metrics, logs, traces
│   │   └── rag-search-platform.md          # Vector DB + RAG architecture
│   ├── interview-revision-cheatsheet.md    # Master cheat sheet for all topics
│   └── architecture-decision-flow.md       # Decision flowcharts (Mermaid)
├── src/
│   ├── DataIntensiveLearning.Core/         # Shared domain models + abstractions
│   │   ├── Enums/                          # DatabaseType, ReplicationType, etc.
│   │   ├── Interfaces/                     # IPartitioner, IReplicationSimulator
│   │   └── Models/                         # DataRecord, ReplicaNode, Partition
│   ├── DataIntensiveLearning.Simulations/  # Runnable console simulations
│   │   ├── Program.cs                      # Entry point (run with --scenario)
│   │   └── Replication/
│   │       ├── LeaderFollowerSimulator.cs   # Sync/async replication + failover
│   │       ├── MultiLeaderSimulator.cs      # Conflict detection + resolution
│   │       └── LeaderlessSimulator.cs       # Quorum reads/writes + read repair
│   ├── DataIntensiveLearning.Partitioning/ # Partitioning strategy implementations
│   │   ├── HashPartitioner.cs              # Hash-mod with distribution analysis
│   │   ├── RangePartitioner.cs             # Boundary-based range partitioning
│   │   ├── ConsistentHashPartitioner.cs    # Hash ring with virtual nodes
│   │   ├── ListPartitioner.cs              # Explicit key-to-partition mapping
│   │   ├── TimeBasedPartitioner.cs         # Time-bucketed partitioning
│   │   ├── CompositePartitioner.cs         # Hash + time two-level partitioning
│   │   └── SkewAnalyzer.cs                 # Hot partition detection
│   ├── DataIntensiveLearning.FileFormats/  # File format examples
│   │   ├── CsvFormatExample.cs             # CSV read/write
│   │   ├── JsonFormatExample.cs            # JSON serialization
│   │   ├── XmlFormatExample.cs             # XML serialization + namespaces
│   │   ├── AvroFormatExample.cs            # Avro schema + evolution concepts
│   │   ├── ParquetFormatExample.cs         # Columnar storage concepts
│   │   ├── OrcFormatExample.cs             # ORC vs Parquet comparison
│   │   ├── ProtobufFormatExample.cs        # Binary encoding + schema evolution
│   │   └── Formats/                        # Reusable format implementations
│   ├── DataIntensiveLearning.DatabaseSelection/
│   │   ├── DatabaseSelectionAdvisor.cs     # Rule-based recommendation engine
│   │   └── Models/                         # WorkloadProfile, DatabaseRecommendation
│   └── DataIntensiveLearning.Api/          # ASP.NET Core API scaffold
├── tests/
│   └── DataIntensiveLearning.UnitTests/
│       ├── Partitioning/                   # Hash, range, consistent hash, list, time, composite tests
│       ├── Replication/                    # Quorum, leaderless simulator tests
│       ├── FileFormats/                    # CSV, JSON serialization tests
│       └── DatabaseSelection/              # Advisor recommendation tests
├── docker/
│   └── Dockerfile.api                      # .NET 10 API container
├── docker-compose.yml                      # PostgreSQL, Redis, MongoDB, Neo4j, TimescaleDB
├── DataIntensiveLearning.sln               # .NET solution file
└── README.md                               # This file

Study Paths

Path 1: Beginner (Foundations First)

graph LR
    A[1. Replication] --> B[2. Partitioning]
    B --> C[3. Database Selection]
    C --> D[4. File Formats]
    D --> E[5. System Design Scenarios]

    style A fill:#0984e3,color:#fff
    style B fill:#00b894,color:#fff
    style C fill:#fdcb6e,color:#2d3436
    style D fill:#e17055,color:#fff
    style E fill:#6c5ce7,color:#fff
Loading
  1. Replication — How data is copied across nodes and why consistency is hard.
  2. Partitioning — How to split data across machines for horizontal scale.
  3. Database Selection — Choosing the right database for each workload.
  4. File Formats — How serialization and storage format choices affect performance.
  5. System Design — Apply everything in realistic design exercises.

Path 2: Interview Prep (Fast Track)

  1. Interview Cheat Sheet — One-page summaries and common questions.
  2. Architecture Decision Flow — Decision flowcharts for databases, formats, partitioning.
  3. Database Decision Matrix — "Choose X when Y" tables.
  4. System Design Scenarios — Practice articulating design decisions.

Path 3: Deep Dive by Topic

Start Here Then Read Then Code
Replication Overview Leader-FollowerMulti-LeaderLeaderless Run dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario replication
Partitioning Overview HashRangeConsistent Hashing Run dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario partitioning
SQL vs NoSQL Individual DB guides → Decision Matrix Run dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario db-selection
Format Comparison Individual format docs → Schema Evolution Run dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario file-formats

Topic-by-Topic Navigation

Replication

Document What You'll Learn
replication-overview.md All three topologies compared, sync vs async, when to use each
leader-follower.md Single-leader write flow, read replicas, failover process
multi-leader.md Multi-datacenter replication, conflict resolution (LWW, merge, CRDTs)
leaderless.md Quorum reads/writes (W+R>N), read repair, anti-entropy, sloppy quorums
consistency-and-lag.md Read-after-write, monotonic reads, bounded staleness, causal consistency
failure-scenarios.md Leader failure, split-brain, network partitions, data loss scenarios

Partitioning & Sharding

Document What You'll Learn
partitioning-overview.md Partitioning vs sharding, horizontal vs vertical, OLTP vs OLAP
hash-partitioning.md Hash functions, distribution analysis, the rehashing problem
range-partitioning.md Boundary selection, time-series advantages, hot spot risks
consistent-hashing.md Hash ring, virtual nodes, minimal redistribution on scaling
tenant-sharding.md Multi-tenant isolation levels, geo-sharding, GDPR compliance
hot-partitions-and-rebalancing.md Skew detection, key salting, rebalancing strategies
partition-key-selection.md How to choose keys, cardinality, composite keys, common mistakes

Database Selection

Document What You'll Learn
sql-vs-nosql.md Relational vs non-relational: when to choose each
document-db-guide.md MongoDB, Couchbase: flexible schema, embedding vs referencing
when-to-use-graph-db.md Neo4j, Neptune: fraud detection, recommendations, knowledge graphs
when-to-use-vector-db.md Pinecone, pgvector: semantic search, RAG, embeddings
when-to-use-timeseries-db.md TimescaleDB, InfluxDB: IoT, metrics, downsampling
when-to-use-key-value-db.md Redis, DynamoDB: caching, sessions, rate limiting
when-to-use-wide-column-db.md Cassandra, ScyllaDB: massive write throughput, global distribution
decision-matrix.md Decision trees, comparison tables, "choose X when Y" guides

File Formats

Document What You'll Learn
csv.md Universal exchange, limitations, when to use
json.md APIs, config, JSONL streaming, JSON Schema
xml.md Enterprise integration, namespaces, SOAP
avro.md Kafka serialization, schema evolution, Schema Registry
parquet.md Columnar analytics, predicate pushdown, compression
orc.md Hive ecosystem, stripe indexing, ACID support
format-comparison-matrix.md All formats compared, decision tree, pipeline diagrams
schema-evolution.md Backward/forward/full compatibility, migration patterns

System Design Scenarios

Document What You'll Learn
ecommerce-platform.md Orders (SQL) + catalog (MongoDB) + cache (Redis) + analytics
iot-platform.md Sensor ingestion → Kafka → TimescaleDB, time partitioning
fraud-detection.md Graph traversal for fraud rings, Neo4j + real-time streaming
observability-pipeline.md Metrics + logs + traces, Prometheus, Parquet archival
rag-search-platform.md Embeddings + vector DB + hybrid search, pgvector vs Pinecone

Getting Started

Prerequisites

Running the Simulations

# Run ALL simulations (replication, partitioning, file formats, db selection)
dotnet run --project src/DataIntensiveLearning.Simulations

# Run a specific simulation category
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario replication
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario partitioning
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario file-formats
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario db-selection

Running Tests

# Run all tests
dotnet test

# Run tests with detailed output
dotnet test --verbosity normal

# Run a specific test category
dotnet test --filter "FullyQualifiedName~Partitioning"
dotnet test --filter "FullyQualifiedName~Replication"
dotnet test --filter "FullyQualifiedName~DatabaseSelection"

Docker Compose Setup

The docker-compose.yml provides a full local environment with multiple database engines for hands-on comparison.

# Start all services
docker-compose up -d

# Verify all containers are healthy
docker-compose ps

# Stop all services
docker-compose down

# Stop and remove all data volumes
docker-compose down -v

Default connection strings (see docker-compose.yml for credentials):

Service Connection Purpose
PostgreSQL Host=localhost;Port=5432;Database=datalearning;Username=admin;Password=admin SQL/relational patterns
Redis localhost:6379 Caching, sessions, key-value patterns
MongoDB mongodb://admin:admin@localhost:27017 Document DB patterns
Neo4j bolt://localhost:7687 (neo4j/adminpassword) Graph DB patterns
TimescaleDB Host=localhost;Port=5433;Database=timeseries;Username=admin;Password=admin Time-series patterns

Key Tradeoffs at a Glance

Tradeoff Option A Option B When to Choose A When to Choose B
Consistency vs. Availability Strong consistency Eventual consistency Financial transactions, inventory Social feeds, analytics, caching
Normalization vs. Denormalization Normalized (3NF) Denormalized Write-heavy, data integrity critical Read-heavy, query performance critical
Row vs. Columnar Storage Row-oriented Column-oriented OLTP, point lookups, frequent writes OLAP, aggregations, scan-heavy queries
SQL vs. NoSQL Relational DB Document/KV/Graph DB Complex joins, ACID needed Flexible schema, horizontal scale
Replication: Sync vs. Async Synchronous Asynchronous Durability guarantees required Low latency, high throughput
Partitioning: Hash vs. Range Hash partitioning Range partitioning Even distribution, no range queries Range scans, time-series data
Binary vs. Text Formats Avro/Protobuf/Parquet JSON/CSV/XML Performance, bandwidth, large data Human readability, debugging, interchange
Dedicated vs. Embedded DB Purpose-built (Neo4j, InfluxDB) Extension (pgvector, TimescaleDB) Max performance, specialized features Reduce operational complexity

What's in the Code

Replication Simulations

Simulator What It Demonstrates
LeaderFollowerSimulator Sync/async replication, replication lag, stale reads, follower failure/recovery, leader failover
MultiLeaderSimulator Cross-datacenter replication, write conflicts, LWW resolution, custom merge functions
LeaderlessSimulator Quorum writes (W nodes), quorum reads (R nodes), read repair, anti-entropy, sloppy quorums

Partitioning Implementations

Partitioner What It Demonstrates
HashPartitioner Hash-mod routing, distribution analysis, rehashing problem, hash function comparison
RangePartitioner Boundary-based routing, range query support, partition pruning
ConsistentHashPartitioner Hash ring, virtual nodes, minimal redistribution on node add/remove
ListPartitioner<T> Explicit mapping, geo-routing, tenant isolation, default partition
TimeBasedPartitioner Time-bucketed partitioning, retention policies, partition plans
CompositePartitioner Two-level (hash + time) like Cassandra partition key + clustering column
SkewAnalyzer CV calculation, hot partition detection, distribution health check

Database Selection Advisor

The DatabaseSelectionAdvisor evaluates workload profiles (read/write ratio, query complexity, consistency needs, scale, latency, data model) and recommends the best database type with reasoning, alternatives, and warnings. Run the predefined scenarios to see recommendations for e-commerce, IoT, fraud detection, RAG, caching, and more.

File Format Examples

Example What It Demonstrates
CSV/JSON/XML Serialization, deserialization, format comparison
Avro Schema evolution, compatibility modes, Schema Registry concepts
Parquet Row vs columnar storage, predicate pushdown, column pruning, compression
ORC Stripe indexing, ACID support, ORC vs Parquet comparison
Protobuf Binary encoding (varint, tags), schema evolution, size comparison

License

This repository is for educational purposes. See LICENSE for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages