Skip to content

Implement Gateway StatefulSet with guild sharding #235

@vcarl

Description

@vcarl

Context

Part of load balancer architecture work (see PR #228 for full analysis).

Risk: High | Reward: Very High | Code Changes: Extensive

This is the core architectural change that enables horizontal scaling.

Architecture Change

Uses config service for assignmentsDeterministic sharding

Gateway StatefulSet (3-10 pods)
├── gateway-0 → handles guilds where hash(guildId) % 3 == 0
├── gateway-1 → handles guilds where hash(guildId) % 3 == 1
└── gateway-2 → handles guilds where hash(guildId) % 3 == 2
    ↓
Each pod: SQLite → Litestream → DO Spaces

Implementation

Shard Calculation

// app/helpers/guildSharding.ts
import crypto from 'crypto';

const POD_ORDINAL = parseInt(process.env.POD_ORDINAL || '0', 10);
const NUM_GATEWAY_PODS = parseInt(process.env.NUM_GATEWAY_PODS || '3', 10);

export function getShardForGuild(guildId: string): number {
  const hash = crypto.createHash('md5').update(guildId).digest();
  const num = hash.readUInt32BE(0);
  return num % NUM_GATEWAY_PODS;
}

export function isGuildMine(guildId: string): boolean {
  return getShardForGuild(guildId) === POD_ORDINAL;
}

Guild Filtering

// app/discord/gateway.ts
import { isGuildMine } from './guildSharding';

client.on(Events.MessageCreate, async (msg) => {
  if (!msg.guildId) return;
  if (!isGuildMine(msg.guildId)) return; // Filter by shard
  
  // Handle message
});

Environment Variables

env:
  - name: POD_ORDINAL
    valueFrom:
      fieldRef:
        fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
  - name: NUM_GATEWAY_PODS
    value: "3"  # Update when scaling

Scaling Procedure

When changing NUM_GATEWAY_PODS:

  1. Update NUM_GATEWAY_PODS env var in all services
  2. Scale StatefulSet: kubectl scale statefulset gateway --replicas=N
  3. Rolling restart HTTP pods to pick up new env
  4. Guilds automatically redistribute based on new modulo

Note: Some guilds will move to different pods. Their SQLite data needs migration or will be rebuilt.

Tasks

  • Implement getShardForGuild() / isGuildMine() functions
  • Add POD_ORDINAL and NUM_GATEWAY_PODS env vars
  • Filter all Discord event handlers by shard
  • Configure Litestream sidecar with per-pod backup paths
  • Update cluster/proposed/gateway-service.yaml (remove CONFIG_SERVICE_URL)
  • Test multi-pod gateway in staging
  • Test scaling up/down procedure
  • Test pod failure recovery
  • Document scaling runbook

Dependencies

Migration Strategy

  1. Deploy new gateway StatefulSet with 3 pods
  2. Guilds auto-distribute based on hash
  3. Each pod builds its SQLite from Discord events (or restore from backup)
  4. Switch ingress to new HTTP service
  5. Decommission old StatefulSet

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions