# Document Stores: MongoDB & CouchDB Patterns

Document stores are a category of NoSQL databases that store data as semi-structured documents (JSON, BSON, XML). They offer **schema flexibility**, **horizontal scalability**, and **developer-friendly data models**.

## Key Concepts

| Concept | Description |
|---------|-------------|
| **Document** | Self-contained data unit (JSON-like), analogous to a row in RDBMS |
| **Collection** | Group of documents, analogous to a table |
| **Database** | Container for collections |
| **BSON** | Binary JSON - MongoDB's storage format with extended types |
| **Schema Flexibility** | Documents in the same collection can have different fields |

---
## 1. The Document Model

### What is a Document?

A document is a **self-describing, hierarchical data structure** that maps closely to objects in programming languages.

```json
{
  "_id": "ObjectId('507f1f77bcf86cd799439011')",
  "name": "John Doe",
  "email": "john@example.com",
  "orders": [
    { "product": "Laptop", "price": 1200 },
    { "product": "Mouse", "price": 25 }
  ],
  "address": {
    "city": "New York",
    "zip": "10001"
  }
}
```

### BSON (Binary JSON)

MongoDB uses **BSON** internally, which extends JSON with:
- **Additional data types**: `ObjectId`, `Date`, `Binary`, `Decimal128`, `Int32`, `Int64`
- **Efficient encoding**: Faster parsing and traversal
- **Field ordering**: Preserves insertion order

### Schema Flexibility

Unlike RDBMS, document stores allow:
- **Polymorphic documents** - different structures in the same collection
- **Evolution without migrations** - add/remove fields anytime
- **Optional validation** - enforce schema when needed (JSON Schema)

In [None]:
import json
from datetime import datetime
from typing import Any
from uuid import uuid4
from copy import deepcopy

# Simulated ObjectId generator
def generate_object_id() -> str:
    return str(uuid4())[:24]

# Example documents with schema flexibility
documents = [
    {
        "_id": generate_object_id(),
        "type": "user",
        "name": "Alice Johnson",
        "email": "alice@example.com",
        "created_at": datetime.now().isoformat()
    },
    {
        "_id": generate_object_id(),
        "type": "user",
        "name": "Bob Smith",
        "email": "bob@example.com",
        "phone": "+1-555-0123",  # Additional field - schema flexibility!
        "preferences": {"theme": "dark", "notifications": True},
        "created_at": datetime.now().isoformat()
    }
]

print("Schema Flexibility Demo:")
for doc in documents:
    print(f"  Document fields: {list(doc.keys())}")

---
## 2. Simulated Document Store

Let's build a simple in-memory document store to understand core operations.

In [None]:
class DocumentStore:
    """Simulated document store mimicking MongoDB operations."""
    
    def __init__(self):
        self.collections: dict[str, list[dict]] = {}
    
    def get_collection(self, name: str) -> list:
        """Get or create a collection."""
        if name not in self.collections:
            self.collections[name] = []
        return self.collections[name]
    
    # --- CRUD Operations ---
    
    def insert_one(self, collection: str, document: dict) -> str:
        """Insert a single document."""
        doc = deepcopy(document)
        if "_id" not in doc:
            doc["_id"] = generate_object_id()
        self.get_collection(collection).append(doc)
        return doc["_id"]
    
    def insert_many(self, collection: str, documents: list[dict]) -> list[str]:
        """Insert multiple documents."""
        return [self.insert_one(collection, doc) for doc in documents]
    
    def find(self, collection: str, query: dict = None) -> list[dict]:
        """Find documents matching query."""
        query = query or {}
        results = []
        for doc in self.get_collection(collection):
            if self._matches(doc, query):
                results.append(deepcopy(doc))
        return results
    
    def find_one(self, collection: str, query: dict) -> dict | None:
        """Find a single document."""
        results = self.find(collection, query)
        return results[0] if results else None
    
    def update_one(self, collection: str, query: dict, update: dict) -> int:
        """Update first matching document. Returns count of modified docs."""
        for doc in self.get_collection(collection):
            if self._matches(doc, query):
                self._apply_update(doc, update)
                return 1
        return 0
    
    def delete_one(self, collection: str, query: dict) -> int:
        """Delete first matching document."""
        coll = self.get_collection(collection)
        for i, doc in enumerate(coll):
            if self._matches(doc, query):
                coll.pop(i)
                return 1
        return 0
    
    # --- Helper Methods ---
    
    def _matches(self, doc: dict, query: dict) -> bool:
        """Check if document matches query (simplified)."""
        for key, value in query.items():
            # Handle nested keys with dot notation
            keys = key.split(".")
            current = doc
            for k in keys:
                if isinstance(current, dict) and k in current:
                    current = current[k]
                else:
                    return False
            if current != value:
                return False
        return True
    
    def _apply_update(self, doc: dict, update: dict) -> None:
        """Apply update operators (simplified)."""
        if "$set" in update:
            for key, value in update["$set"].items():
                doc[key] = value
        if "$unset" in update:
            for key in update["$unset"]:
                doc.pop(key, None)
        if "$inc" in update:
            for key, value in update["$inc"].items():
                doc[key] = doc.get(key, 0) + value
        if "$push" in update:
            for key, value in update["$push"].items():
                if key not in doc:
                    doc[key] = []
                doc[key].append(value)

# Initialize store
db = DocumentStore()
print("DocumentStore initialized!")

In [None]:
# Demo CRUD operations

# Insert documents
users = [
    {"name": "Alice", "age": 30, "department": "Engineering", "skills": ["Python", "MongoDB"]},
    {"name": "Bob", "age": 25, "department": "Engineering", "skills": ["JavaScript", "React"]},
    {"name": "Charlie", "age": 35, "department": "Sales", "skills": ["CRM", "Analytics"]},
    {"name": "Diana", "age": 28, "department": "Engineering", "skills": ["Python", "Docker"]},
]

ids = db.insert_many("users", users)
print(f"Inserted {len(ids)} documents")

# Find operations
print("\n--- Find all Engineering users ---")
engineers = db.find("users", {"department": "Engineering"})
for eng in engineers:
    print(f"  {eng['name']}: {eng['skills']}")

# Update operations
print("\n--- Update Alice's age ---")
db.update_one("users", {"name": "Alice"}, {"$inc": {"age": 1}})
alice = db.find_one("users", {"name": "Alice"})
print(f"  Alice's new age: {alice['age']}")

# Add new skill using $push
print("\n--- Add skill to Bob ---")
db.update_one("users", {"name": "Bob"}, {"$push": {"skills": "Node.js"}})
bob = db.find_one("users", {"name": "Bob"})
print(f"  Bob's skills: {bob['skills']}")

---
## 3. Embedding vs Referencing Patterns

A critical design decision in document databases is choosing between **embedding** related data or **referencing** it.

### Embedding (Denormalization)

Store related data **within** the same document.

```json
{
  "_id": "order_001",
  "customer": {
    "name": "John Doe",
    "email": "john@example.com"
  },
  "items": [
    { "product": "Laptop", "price": 1200, "qty": 1 }
  ]
}
```

**Pros:**
- Single query retrieves all data
- Atomic updates within document
- Better read performance

**Cons:**
- Document size limits (16MB in MongoDB)
- Data duplication
- Harder to update shared data

---

### Referencing (Normalization)

Store related data in **separate documents** with references.

```json
// Order document
{
  "_id": "order_001",
  "customer_id": "user_001",
  "item_ids": ["item_001", "item_002"]
}

// User document
{
  "_id": "user_001",
  "name": "John Doe",
  "email": "john@example.com"
}
```

**Pros:**
- No data duplication
- Smaller documents
- Flexible relationships

**Cons:**
- Multiple queries required
- No join support (application-level joins)
- Less atomic operations

In [None]:
# Pattern 1: Embedded Documents
print("=" * 50)
print("EMBEDDED PATTERN")
print("=" * 50)

# Order with embedded customer and items
embedded_order = {
    "_id": "order_001",
    "order_date": "2026-02-01",
    "status": "shipped",
    "customer": {
        "name": "John Doe",
        "email": "john@example.com",
        "shipping_address": {
            "street": "123 Main St",
            "city": "New York",
            "zip": "10001"
        }
    },
    "items": [
        {"product": "Laptop", "sku": "LAP-001", "price": 1200, "qty": 1},
        {"product": "Mouse", "sku": "MOU-003", "price": 25, "qty": 2}
    ],
    "total": 1250
}

db.insert_one("orders_embedded", embedded_order)

# Single query gets everything!
order = db.find_one("orders_embedded", {"_id": "order_001"})
print(f"\nOrder for: {order['customer']['name']}")
print(f"Ship to: {order['customer']['shipping_address']['city']}")
print(f"Items: {len(order['items'])}")
print(f"Total: ${order['total']}")

In [None]:
# Pattern 2: Referenced Documents
print("=" * 50)
print("REFERENCED PATTERN")
print("=" * 50)

# Separate collections with references
customer = {
    "_id": "user_001",
    "name": "Jane Smith",
    "email": "jane@example.com",
    "addresses": [
        {"type": "home", "city": "Boston", "zip": "02101"},
        {"type": "work", "city": "Cambridge", "zip": "02139"}
    ]
}

products = [
    {"_id": "prod_001", "name": "Keyboard", "price": 150, "stock": 50},
    {"_id": "prod_002", "name": "Monitor", "price": 400, "stock": 20}
]

referenced_order = {
    "_id": "order_002",
    "customer_id": "user_001",  # Reference
    "items": [
        {"product_id": "prod_001", "qty": 2},  # References
        {"product_id": "prod_002", "qty": 1}
    ],
    "shipping_address_type": "work"
}

db.insert_one("customers", customer)
db.insert_many("products", products)
db.insert_one("orders_referenced", referenced_order)

# Application-level join (multiple queries)
def get_order_with_details(order_id: str) -> dict:
    """Simulate $lookup by joining data at application level."""
    order = db.find_one("orders_referenced", {"_id": order_id})
    if not order:
        return None
    
    # Fetch customer
    customer = db.find_one("customers", {"_id": order["customer_id"]})
    
    # Fetch products and calculate total
    total = 0
    enriched_items = []
    for item in order["items"]:
        product = db.find_one("products", {"_id": item["product_id"]})
        line_total = product["price"] * item["qty"]
        enriched_items.append({
            "product": product["name"],
            "price": product["price"],
            "qty": item["qty"],
            "subtotal": line_total
        })
        total += line_total
    
    return {
        "order_id": order["_id"],
        "customer_name": customer["name"],
        "items": enriched_items,
        "total": total
    }

result = get_order_with_details("order_002")
print(f"\nOrder: {result['order_id']}")
print(f"Customer: {result['customer_name']}")
for item in result['items']:
    print(f"  - {item['product']} x{item['qty']} = ${item['subtotal']}")
print(f"Total: ${result['total']}")

### Decision Matrix: When to Embed vs Reference

| Factor | Embed | Reference |
|--------|-------|----------|
| **Relationship** | 1:1, 1:few | 1:many, many:many |
| **Read pattern** | Data accessed together | Data accessed separately |
| **Update pattern** | Child data rarely changes | Child data changes frequently |
| **Data size** | Child data is small | Child data is large or unbounded |
| **Atomicity** | Need atomic operations | Can tolerate eventual consistency |

---
## 4. Aggregation Pipeline Concepts

The aggregation pipeline is a powerful framework for data transformation and analysis. Data flows through **stages**, each transforming the documents.

### Common Pipeline Stages

| Stage | Purpose |
|-------|--------|
| `$match` | Filter documents (like WHERE) |
| `$group` | Group by field and aggregate (like GROUP BY) |
| `$project` | Reshape documents (like SELECT) |
| `$sort` | Order documents |
| `$limit` / `$skip` | Pagination |
| `$unwind` | Deconstruct arrays |
| `$lookup` | Left outer join |
| `$addFields` | Add computed fields |

In [None]:
class AggregationPipeline:
    """Simulated MongoDB-style aggregation pipeline."""
    
    def __init__(self, documents: list[dict]):
        self.documents = deepcopy(documents)
    
    def match(self, criteria: dict) -> 'AggregationPipeline':
        """$match - Filter documents."""
        self.documents = [
            doc for doc in self.documents
            if all(doc.get(k) == v for k, v in criteria.items())
        ]
        return self
    
    def project(self, fields: dict) -> 'AggregationPipeline':
        """$project - Select/reshape fields."""
        result = []
        for doc in self.documents:
            new_doc = {}
            for field, include in fields.items():
                if include and field in doc:
                    new_doc[field] = doc[field]
            result.append(new_doc)
        self.documents = result
        return self
    
    def group(self, _id: str, **accumulators) -> 'AggregationPipeline':
        """$group - Group and aggregate."""
        groups = {}
        for doc in self.documents:
            key = doc.get(_id) if _id else None
            if key not in groups:
                groups[key] = {"_id": key, "_docs": []}
            groups[key]["_docs"].append(doc)
        
        result = []
        for key, group in groups.items():
            doc = {"_id": key}
            for acc_name, (op, field) in accumulators.items():
                values = [d.get(field, 0) for d in group["_docs"]]
                if op == "$sum":
                    doc[acc_name] = sum(values)
                elif op == "$avg":
                    doc[acc_name] = sum(values) / len(values) if values else 0
                elif op == "$count":
                    doc[acc_name] = len(group["_docs"])
                elif op == "$max":
                    doc[acc_name] = max(values) if values else None
                elif op == "$min":
                    doc[acc_name] = min(values) if values else None
            result.append(doc)
        self.documents = result
        return self
    
    def sort(self, field: str, ascending: bool = True) -> 'AggregationPipeline':
        """$sort - Order documents."""
        self.documents.sort(key=lambda x: x.get(field, 0), reverse=not ascending)
        return self
    
    def limit(self, n: int) -> 'AggregationPipeline':
        """$limit - Return first n documents."""
        self.documents = self.documents[:n]
        return self
    
    def unwind(self, array_field: str) -> 'AggregationPipeline':
        """$unwind - Deconstruct array field."""
        result = []
        for doc in self.documents:
            arr = doc.get(array_field, [])
            if isinstance(arr, list):
                for item in arr:
                    new_doc = deepcopy(doc)
                    new_doc[array_field] = item
                    result.append(new_doc)
            else:
                result.append(doc)
        self.documents = result
        return self
    
    def result(self) -> list[dict]:
        """Return aggregation result."""
        return self.documents

print("AggregationPipeline class defined!")

In [None]:
# Sample sales data for aggregation
sales = [
    {"product": "Laptop", "category": "Electronics", "price": 1200, "qty": 2, "region": "North"},
    {"product": "Mouse", "category": "Electronics", "price": 25, "qty": 10, "region": "South"},
    {"product": "Desk", "category": "Furniture", "price": 350, "qty": 3, "region": "North"},
    {"product": "Chair", "category": "Furniture", "price": 200, "qty": 5, "region": "East"},
    {"product": "Keyboard", "category": "Electronics", "price": 150, "qty": 8, "region": "North"},
    {"product": "Monitor", "category": "Electronics", "price": 400, "qty": 4, "region": "South"},
    {"product": "Bookshelf", "category": "Furniture", "price": 180, "qty": 2, "region": "East"},
]

# Add computed revenue field
for sale in sales:
    sale["revenue"] = sale["price"] * sale["qty"]

print("Sales Data:")
for s in sales:
    print(f"  {s['product']:12} | {s['category']:12} | ${s['revenue']:>6}")

In [None]:
# Aggregation Example 1: Total revenue by category
print("=" * 50)
print("Revenue by Category (sorted descending)")
print("=" * 50)

result = (
    AggregationPipeline(sales)
    .group(
        "category",
        total_revenue=("$sum", "revenue"),
        count=("$count", "_id")
    )
    .sort("total_revenue", ascending=False)
    .result()
)

for doc in result:
    print(f"  {doc['_id']:15} Revenue: ${doc['total_revenue']:>6}  Items: {doc['count']}")

In [None]:
# Aggregation Example 2: Electronics products with revenue > $1000
print("=" * 50)
print("Electronics with Revenue > $1000")
print("=" * 50)

result = (
    AggregationPipeline(sales)
    .match({"category": "Electronics"})
    .project({"product": True, "revenue": True})
    .result()
)

# Filter for revenue > 1000 (simulating $match after $project)
high_revenue = [r for r in result if r["revenue"] > 1000]

for doc in high_revenue:
    print(f"  {doc['product']:15} ${doc['revenue']}")

In [None]:
# Aggregation Example 3: Unwind array and count skills
print("=" * 50)
print("$unwind Example: Count skill occurrences")
print("=" * 50)

employees = [
    {"name": "Alice", "skills": ["Python", "MongoDB", "Docker"]},
    {"name": "Bob", "skills": ["JavaScript", "MongoDB", "React"]},
    {"name": "Charlie", "skills": ["Python", "PostgreSQL", "Docker"]},
]

result = (
    AggregationPipeline(employees)
    .unwind("skills")
    .group(
        "skills",
        count=("$count", "_id")
    )
    .sort("count", ascending=False)
    .result()
)

print("\nSkill Frequency:")
for doc in result:
    bar = "‚ñà" * doc["count"]
    print(f"  {doc['_id']:12} {bar} ({doc['count']})")

---
## 5. MongoDB vs CouchDB Comparison

| Feature | MongoDB | CouchDB |
|---------|---------|--------|
| **Query Language** | MQL (MongoDB Query Language) | Mango queries, MapReduce |
| **Replication** | Replica Sets | Multi-Master |
| **Conflict Resolution** | Last-write-wins | MVCC, revision history |
| **Protocol** | Binary (wire protocol) | HTTP/REST |
| **Offline Support** | Limited | Built-in (PouchDB sync) |
| **Indexing** | B-tree, Text, Geospatial | B-tree views |
| **Transactions** | Multi-document ACID (4.0+) | Document-level only |
| **Use Case** | General purpose, real-time | Offline-first, sync-heavy |

In [None]:
# CouchDB-style revision tracking simulation
class CouchDBDocument:
    """Simulates CouchDB's MVCC (Multi-Version Concurrency Control)."""
    
    def __init__(self, doc_id: str, data: dict):
        self._id = doc_id
        self._rev = self._generate_rev(1)
        self._data = deepcopy(data)
        self._history = [{
            "_rev": self._rev,
            "data": deepcopy(data)
        }]
    
    def _generate_rev(self, num: int) -> str:
        """Generate CouchDB-style revision ID."""
        import hashlib
        hash_part = hashlib.md5(str(datetime.now()).encode()).hexdigest()[:8]
        return f"{num}-{hash_part}"
    
    def update(self, rev: str, new_data: dict) -> str:
        """Update document (requires current revision)."""
        if rev != self._rev:
            raise Exception(f"Conflict! Expected rev {self._rev}, got {rev}")
        
        rev_num = int(self._rev.split("-")[0]) + 1
        self._rev = self._generate_rev(rev_num)
        self._data = deepcopy(new_data)
        self._history.append({
            "_rev": self._rev,
            "data": deepcopy(new_data)
        })
        return self._rev
    
    def get(self) -> dict:
        """Get current document state."""
        return {
            "_id": self._id,
            "_rev": self._rev,
            **self._data
        }
    
    def get_revision_history(self) -> list:
        """Get all revisions."""
        return self._history

# Demo CouchDB-style updates
print("CouchDB MVCC Demo")
print("=" * 50)

doc = CouchDBDocument("user_001", {"name": "Alice", "status": "active"})
print(f"\nInitial: {doc.get()}")

# Update with correct revision
current_rev = doc.get()["_rev"]
new_rev = doc.update(current_rev, {"name": "Alice", "status": "premium"})
print(f"After update: {doc.get()}")

# Try update with stale revision (conflict!)
try:
    doc.update(current_rev, {"name": "Alice", "status": "inactive"})  # Using old rev
except Exception as e:
    print(f"\n‚ö†Ô∏è  {e}")

print(f"\nRevision History ({len(doc.get_revision_history())} versions):")
for rev in doc.get_revision_history():
    print(f"  {rev['_rev']}: {rev['data']}")

---
## 6. Best Practices & Patterns

### Schema Design Patterns

| Pattern | Description | Use Case |
|---------|-------------|----------|
| **Attribute** | Store dynamic attributes as array of key-value pairs | Product variants, custom fields |
| **Bucket** | Group time-series data into fixed-size buckets | IoT, logs, metrics |
| **Computed** | Pre-compute and store derived values | Analytics, dashboards |
| **Extended Reference** | Embed frequently-accessed fields from referenced docs | Reduce lookups |
| **Outlier** | Handle documents that exceed normal size | Social media (viral posts) |
| **Polymorphic** | Store different entity types in same collection | Content management |

In [None]:
# Pattern: Attribute Pattern
print("Attribute Pattern Example")
print("=" * 50)

# Instead of fixed schema with many optional fields:
# {"color": "red", "size": "L", "material": "cotton", "weight": null, ...}

# Use attribute pattern:
product_with_attributes = {
    "_id": "prod_123",
    "name": "T-Shirt",
    "price": 29.99,
    "attributes": [
        {"k": "color", "v": "red"},
        {"k": "size", "v": "L"},
        {"k": "material", "v": "cotton"},
        {"k": "brand", "v": "Acme"}
    ]
}

# Easy to query: db.products.find({"attributes.k": "color", "attributes.v": "red"})
# Easy to add new attributes without schema changes

print(json.dumps(product_with_attributes, indent=2))

In [None]:
# Pattern: Bucket Pattern (for time-series data)
print("\nBucket Pattern Example")
print("=" * 50)

# Instead of one document per reading:
# {"sensor_id": "s1", "timestamp": "...", "value": 23.5}

# Bucket by hour:
sensor_bucket = {
    "_id": "sensor_001_2026020115",  # sensor_id + YYYYMMDDHH
    "sensor_id": "sensor_001",
    "bucket_start": "2026-02-01T15:00:00Z",
    "bucket_end": "2026-02-01T15:59:59Z",
    "count": 4,
    "sum": 94.2,  # Pre-computed for fast aggregation
    "min": 22.5,
    "max": 24.1,
    "readings": [
        {"t": "2026-02-01T15:00:00Z", "v": 23.5},
        {"t": "2026-02-01T15:15:00Z", "v": 24.1},
        {"t": "2026-02-01T15:30:00Z", "v": 22.5},
        {"t": "2026-02-01T15:45:00Z", "v": 24.1},
    ]
}

# Benefits:
# - Fewer documents (1 per hour vs 60 per hour)
# - Pre-computed aggregates
# - Efficient time-range queries

print(json.dumps(sensor_bucket, indent=2))
print(f"\nAverage reading: {sensor_bucket['sum'] / sensor_bucket['count']:.2f}")

---
## üéØ Key Takeaways

### Document Model Fundamentals
- Documents are **self-contained, hierarchical data structures** (JSON/BSON)
- **Schema flexibility** allows evolution without migrations
- BSON extends JSON with types like `ObjectId`, `Date`, `Decimal128`

### Embedding vs Referencing
- **Embed** when: data is accessed together, 1:1 or 1:few relationships, child data is small
- **Reference** when: data is accessed independently, many:many relationships, child data changes frequently
- Trade-off: **Read performance (embed)** vs **Write flexibility (reference)**

### Aggregation Pipeline
- Chain of **stages** that transform documents: `$match` ‚Üí `$group` ‚Üí `$project` ‚Üí `$sort`
- **$unwind** expands arrays for per-element analysis
- **$lookup** enables joins (use sparingly for performance)

### MongoDB vs CouchDB
- **MongoDB**: General-purpose, strong consistency, rich queries
- **CouchDB**: Offline-first, MVCC conflict resolution, HTTP/REST API

### Design Patterns
- **Attribute Pattern**: Dynamic key-value pairs for flexible schemas
- **Bucket Pattern**: Group time-series data for efficiency
- **Extended Reference**: Denormalize frequently-accessed fields

### When to Choose Document Stores
‚úÖ Rapid development with evolving schemas  
‚úÖ Hierarchical data that maps to application objects  
‚úÖ Read-heavy workloads with embedded data  
‚ùå Complex multi-table transactions  
‚ùå Highly normalized relational data