# Replication & Sharding — Overview

## Purpose
Understand the fundamental strategies for scaling databases horizontally: **replication** for high availability and read performance, and **sharding** for distributing data across multiple nodes to handle massive datasets.

## Key Questions
1. When should you replicate data vs. shard it?
2. What are the consistency trade-offs in replication?
3. How do you choose a good sharding key?
4. What happens when a shard becomes a hotspot?
5. How do replication and sharding work together in production systems?

---
## Why We Need Replication

**Replication** copies data across multiple database nodes. Each copy is called a **replica**.

### Key Benefits

| Benefit | Description |
|---------|-------------|
| **High Availability** | If one node fails, others continue serving requests |
| **Read Scaling** | Distribute read queries across replicas to reduce load on primary |
| **Durability** | Data survives hardware failures when stored on multiple machines |
| **Geographic Distribution** | Place replicas closer to users for lower latency |

### Replication Topologies

1. **Leader-Follower (Primary-Replica)**
   - One leader handles writes; followers replicate and serve reads
   - Simple but leader is a single point of failure for writes

2. **Leader-Leader (Multi-Primary)**
   - Multiple nodes accept writes
   - Requires conflict resolution

3. **Leaderless**
   - Any node can accept reads/writes (e.g., Cassandra, DynamoDB)
   - Uses quorum-based consistency

---
## Why We Need Sharding

**Sharding** (horizontal partitioning) splits data across multiple database instances. Each instance holds a subset of the data.

### Key Benefits

| Benefit | Description |
|---------|-------------|
| **Horizontal Scaling** | Add more shards to handle increased load |
| **Data Size** | No single machine needs to hold the entire dataset |
| **Write Scaling** | Distribute writes across multiple nodes |
| **Query Performance** | Smaller indexes per shard = faster lookups |

### Sharding Strategies

1. **Range-Based Sharding**
   - Partition by value ranges (e.g., A-M on shard 1, N-Z on shard 2)
   - Good for range queries, but can cause hotspots

2. **Hash-Based Sharding**
   - Hash the shard key to determine placement
   - Even distribution, but range queries require scatter-gather

3. **Directory-Based Sharding**
   - Lookup table maps keys to shards
   - Flexible but adds lookup overhead

---
## Replication vs. Sharding: Comparison

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create comparison visualization
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Replication: Same Data, Multiple Copies', 
                    'Sharding: Data Split Across Nodes'),
    horizontal_spacing=0.15
)

# Colors for data segments
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

# Replication visualization (left) - 3 replicas with same data
replica_positions = [0.5, 2, 3.5]
replica_labels = ['Primary', 'Replica 1', 'Replica 2']

for i, (x, label) in enumerate(zip(replica_positions, replica_labels)):
    # Each replica has ALL data (stacked bars)
    for j, color in enumerate(colors):
        fig.add_trace(
            go.Bar(
                x=[label],
                y=[25],
                name=f'Data Segment {j+1}' if i == 0 else None,
                marker_color=color,
                showlegend=(i == 0),
                legendgroup=f'seg{j}',
                hovertemplate=f'{label}<br>Data Segment {j+1}<extra></extra>'
            ),
            row=1, col=1
        )

# Sharding visualization (right) - data split across shards
shard_labels = ['Shard 1', 'Shard 2', 'Shard 3', 'Shard 4']
shard_data = [
    [100, 0, 0, 0],   # Shard 1 has segment 1
    [0, 100, 0, 0],   # Shard 2 has segment 2
    [0, 0, 100, 0],   # Shard 3 has segment 3
    [0, 0, 0, 100],   # Shard 4 has segment 4
]

for j, color in enumerate(colors):
    fig.add_trace(
        go.Bar(
            x=shard_labels,
            y=[shard_data[i][j] for i in range(4)],
            marker_color=color,
            showlegend=False,
            hovertemplate='%{x}<br>Data Segment ' + str(j+1) + '<extra></extra>'
        ),
        row=1, col=2
    )

fig.update_layout(
    title=dict(
        text='<b>Replication vs Sharding Concepts</b>',
        x=0.5,
        font=dict(size=18)
    ),
    barmode='stack',
    height=450,
    legend=dict(
        orientation='h',
        yanchor='bottom',
        y=-0.2,
        xanchor='center',
        x=0.5
    ),
    annotations=[
        dict(
            text='↑ All nodes have complete copy of data',
            x=0.18, y=-0.08, xref='paper', yref='paper',
            showarrow=False, font=dict(size=11, color='#555')
        ),
        dict(
            text='↑ Each node has unique subset of data',
            x=0.82, y=-0.08, xref='paper', yref='paper',
            showarrow=False, font=dict(size=11, color='#555')
        )
    ]
)

fig.update_yaxes(title_text='Data Volume', row=1, col=1)
fig.update_yaxes(title_text='Data Volume', row=1, col=2)

fig

---
## When to Use Each Strategy

| Scenario | Replication | Sharding |
|----------|:-----------:|:--------:|
| Read-heavy workload | ✅ | ➖ |
| Write-heavy workload | ➖ | ✅ |
| High availability requirement | ✅ | ➖ |
| Data exceeds single node capacity | ➖ | ✅ |
| Low-latency global access | ✅ | ➖ |
| Need to scale writes linearly | ➖ | ✅ |

### Combined Approach (Production Reality)

Most production systems use **both**:
- **Shard** data across multiple partitions
- **Replicate** each shard for availability

Example: A 12-node cluster might have 4 shards × 3 replicas each.

---
## ✅ Takeaways

1. **Replication** creates copies of data for **availability**, **read scaling**, and **durability**

2. **Sharding** partitions data for **horizontal scaling** and handling **large datasets**

3. **Replication doesn't scale writes** — all replicas must process the same writes

4. **Sharding adds complexity** — cross-shard queries, rebalancing, and hotspot management

5. **Production systems combine both** — shard for scale, replicate each shard for resilience

6. **Choose your shard key carefully** — it determines data distribution and query efficiency