# Data Storage & Modeling Overview

Essential concepts for backend data architecture: database paradigms, consistency models, and practical patterns.

## SQL vs NoSQL

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATABASE PARADIGMS                                │
├─────────────────────────────────┬───────────────────────────────────────────┤
│            SQL                  │              NoSQL                        │
├─────────────────────────────────┼───────────────────────────────────────────┤
│  ┌─────────────────────────┐    │    ┌─────────────────────────────────┐    │
│  │ Structured Schema       │    │    │ Flexible/Schema-less            │    │
│  │ Tables, Rows, Columns   │    │    │ Documents, Key-Value, Graphs    │    │
│  │ ACID Transactions       │    │    │ BASE (Eventually Consistent)    │    │
│  │ Vertical Scaling        │    │    │ Horizontal Scaling              │    │
│  │ Complex Joins           │    │    │ Denormalized Data               │    │
│  └─────────────────────────┘    │    └─────────────────────────────────┘    │
├─────────────────────────────────┼───────────────────────────────────────────┤
│ USE WHEN:                       │ USE WHEN:                                 │
│ • Complex relationships         │ • Rapid iteration/schema changes          │
│ • ACID compliance required      │ • Massive scale/high throughput           │
│ • Data integrity is critical    │ • Unstructured/semi-structured data       │
│ • Reporting/analytics           │ • Geographic distribution                 │
└─────────────────────────────────┴───────────────────────────────────────────┘
```

## Database Types Comparison

| Type | Examples | Data Model | Best For | Limitations |
|------|----------|------------|----------|-------------|
| **Relational** | PostgreSQL, MySQL | Tables with rows/columns | OLTP, complex queries, transactions | Scaling writes, schema rigidity |
| **Document** | MongoDB, CouchDB | JSON-like documents | Content management, catalogs, user profiles | Complex transactions, joins |
| **Key-Value** | Redis, DynamoDB | Simple key→value pairs | Caching, sessions, real-time data | No complex queries |
| **Wide-Column** | Cassandra, HBase | Column families | Time-series, IoT, write-heavy workloads | Ad-hoc queries |
| **Graph** | Neo4j, Neptune | Nodes + edges | Social networks, recommendations, fraud detection | Not for bulk analytics |
| **Time-Series** | InfluxDB, TimescaleDB | Timestamped data points | Metrics, monitoring, financial data | General-purpose queries |
| **Vector** | Pinecone, Milvus | High-dimensional vectors | AI/ML embeddings, similarity search | Traditional queries |

## ACID vs BASE

```
┌──────────────────────────────────┐     ┌──────────────────────────────────┐
│             ACID                 │     │             BASE                 │
│      (Strong Consistency)        │     │     (Eventual Consistency)       │
├──────────────────────────────────┤     ├──────────────────────────────────┤
│                                  │     │                                  │
│  A - Atomicity                   │     │  BA - Basically Available        │
│      All or nothing              │     │       System always responds     │
│                                  │     │                                  │
│  C - Consistency                 │     │  S  - Soft State                 │
│      Valid state transitions     │     │       State may change over time │
│                                  │     │                                  │
│  I - Isolation                   │     │  E  - Eventual Consistency       │
│      Concurrent independence     │     │       Will converge eventually   │
│                                  │     │                                  │
│  D - Durability                  │     │                                  │
│      Committed = Permanent       │     │                                  │
│                                  │     │                                  │
├──────────────────────────────────┤     ├──────────────────────────────────┤
│  ✓ Banking, Inventory            │     │  ✓ Social feeds, Analytics       │
│  ✓ Order processing              │     │  ✓ Caching, Recommendations      │
│  ✗ Lower availability            │     │  ✗ Stale reads possible          │
└──────────────────────────────────┘     └──────────────────────────────────┘
```

## CAP Theorem

**In a distributed system, you can only guarantee 2 of 3 properties:**

```
                         Consistency (C)
                         All nodes see same data
                              /\
                             /  \
                            /    \
                           /  CP  \
                          /        \
                         /    ┌─────┐\
                        /     │TRADE│ \
                       /      │ OFF │  \
                      /       └─────┘   \
                     /    CA      AP     \
                    /________________________\
     Availability (A)                    Partition Tolerance (P)
     Every request                       System works despite
     gets a response                     network failures

┌─────────────────────────────────────────────────────────────────┐
│  CP Systems: MongoDB, HBase, Redis Cluster                      │
│  → Sacrifice availability for consistency during partitions     │
│                                                                 │
│  AP Systems: Cassandra, DynamoDB, CouchDB                       │
│  → Sacrifice consistency for availability during partitions     │
│                                                                 │
│  CA Systems: Single-node RDBMS (PostgreSQL, MySQL)              │
│  → No partition tolerance (not truly distributed)               │
└─────────────────────────────────────────────────────────────────┘
```

> **Reality:** Most systems are tunable - they exist on a spectrum, not fixed categories.

## Indexing Strategies

```
┌────────────────────────────────────────────────────────────────────────────┐
│                           INDEX TYPES                                      │
├───────────────────┬────────────────────────────────────────────────────────┤
│ B-Tree (Default)  │ Balanced tree, O(log n) lookups                        │
│                   │ Best for: equality, range queries, ORDER BY            │
├───────────────────┼────────────────────────────────────────────────────────┤
│ Hash              │ O(1) lookups, no range support                         │
│                   │ Best for: exact equality only (e.g., user lookups)     │
├───────────────────┼────────────────────────────────────────────────────────┤
│ GIN (Inverted)    │ Multiple values per row                                │
│                   │ Best for: arrays, JSONB, full-text search              │
├───────────────────┼────────────────────────────────────────────────────────┤
│ GiST              │ Generalized search tree                                │
│                   │ Best for: geometric data, ranges, nearest-neighbor     │
├───────────────────┼────────────────────────────────────────────────────────┤
│ Composite         │ Multi-column index                                     │
│                   │ Column order matters! (leftmost prefix rule)           │
└───────────────────┴────────────────────────────────────────────────────────┘

INDEX GUIDELINES:
  ✓ Index columns in WHERE, JOIN, ORDER BY clauses
  ✓ Use covering indexes for read-heavy queries
  ✓ Consider partial indexes for filtered data
  ✗ Avoid over-indexing (slows writes)
  ✗ Don't index low-cardinality columns alone
```

## 7. Schema Evolution & Migrations

**Schema evolution** keeps systems running while data models change.
- **Backward compatible**: Add nullable columns, add new tables.
- **Forward compatible**: Avoid breaking older readers (use default values).
- **Two-phase changes**: deploy code that handles both schemas, then backfill, then remove old fields.

**Migration strategies:**
- **Expand–contract**: add new schema, backfill, switch, remove old.
- **Online migrations**: small batches to avoid locks/long transactions.
- **Versioned APIs**: keep old clients functional during transitions.

**Tip:** Treat schema changes as deployments with rollbacks and observability.

## 6. Partitioning, Sharding, and Replication

**Partitioning** splits a table into logical pieces (by range, list, or hash).
- Improves query performance and maintenance (pruning, archiving).
- Common for time-series: partition by date.

**Sharding** distributes data across multiple nodes.
- **Key-based sharding**: consistent hashing or range keys.
- **Trade-off**: cross-shard joins are expensive.

**Replication** copies data for availability and read scaling.
- **Leader–follower**: fast reads, possible replication lag.
- **Multi-leader**: higher availability, conflict resolution required.
- **Quorum reads/writes**: tune consistency vs availability (N, R, W).

**Design tip:** Choose a shard key aligned with dominant access patterns (avoid hotspots).

## Basic Data Modeling

```
NORMALIZATION (Reduce Redundancy)         DENORMALIZATION (Optimize Reads)
─────────────────────────────────         ─────────────────────────────────
┌──────────┐    ┌──────────────┐          ┌────────────────────────────────┐
│  Users   │    │    Orders    │          │         OrdersFlattened        │
├──────────┤    ├──────────────┤          ├────────────────────────────────┤
│ id (PK)  │◄───│ user_id (FK) │   VS     │ order_id, user_name, email,    │
│ name     │    │ order_id     │          │ product_name, quantity, price, │
│ email    │    │ product_id   │          │ order_date, shipping_address   │
└──────────┘    └──────────────┘          └────────────────────────────────┘
     │               │                    
     │          ┌────┴─────┐              Fewer JOINs = Faster reads
     │          │ Products │              More storage = Higher redundancy
     │          └──────────┘              

COMMON PATTERNS:
┌────────────────────────────────────────────────────────────────────────────┐
│ 1:1  → Embed or separate table (user ↔ profile)                            │
│ 1:N  → Foreign key on "many" side (user → orders)                          │
│ M:N  → Junction/bridge table (students ↔ courses)                          │
│ Tree → Adjacency list, nested sets, or materialized path                   │
└────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# SQLAlchemy ORM Example - Basic Data Modeling
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey, DateTime, Numeric, Index
from sqlalchemy.orm import declarative_base, relationship, sessionmaker
from datetime import datetime

Base = declarative_base()

# User model (1:N with Orders)
class User(Base):
    __tablename__ = 'users'
    
    id = Column(Integer, primary_key=True)
    email = Column(String(255), unique=True, nullable=False, index=True)  # Single-column index
    name = Column(String(100), nullable=False)
    created_at = Column(DateTime, default=datetime.utcnow)
    
    orders = relationship('Order', back_populates='user', lazy='dynamic')

# Order model with composite index
class Order(Base):
    __tablename__ = 'orders'
    
    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey('users.id'), nullable=False)
    status = Column(String(20), default='pending')
    total = Column(Numeric(10, 2))
    created_at = Column(DateTime, default=datetime.utcnow)
    
    user = relationship('User', back_populates='orders')
    
    # Composite index for common query: "user's orders by status"
    __table_args__ = (
        Index('idx_user_status', 'user_id', 'status'),
    )

# Usage example
engine = create_engine('sqlite:///:memory:', echo=False)
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()

# Create user and order
user = User(email='dev@example.com', name='Dev User')
order = Order(user=user, status='completed', total=99.99)
session.add_all([user, order])
session.commit()

# Query with relationship
result = session.query(User).filter_by(email='dev@example.com').first()
print(f"User: {result.name}, Orders: {result.orders.count()}")

## Quick Reference

| Decision | Choose This | When |
|----------|-------------|------|
| SQL vs NoSQL | **SQL** | ACID needed, complex relations, structured data |
| SQL vs NoSQL | **NoSQL** | Scale-out, flexible schema, specific access patterns |
| Normalize | **Yes** | Write-heavy, data integrity critical |
| Denormalize | **Yes** | Read-heavy, known query patterns |
| Add Index | **Yes** | Column in WHERE/JOIN/ORDER, high cardinality |
| Skip Index | **Yes** | Small tables, low cardinality, write-heavy tables |

**Key Takeaway:** Start with normalized SQL unless you have a specific reason not to. Denormalize and add NoSQL components as scaling demands.