🚀 Benchmark Highlights: By switching from traditional Random routing to Cache-Aware (Hash) routing on a 3-worker cluster, we achieved:
- 12.6x Speedup: Average latency dropped from 75.39ms to just 5.98ms!
- 100% Cache Efficiency: Cache hit rate increased from 83.3% to 100%.
- Zero redundant ML compute across the workers for shared prompts.
In distributed ML inference architectures, identical or highly similar requests (sharing common prefixes) are often naturally load-balanced across multiple worker nodes using random or round-robin routing. As a result, each worker must independently compute and cache identical request segments, leading to redundant work, suboptimal cache hit rates, and higher overall latency.
When requests sharing a common prefix (e.g., standard system prompts, shared conversational context) are routed to the same worker, that worker's LRU cache can serve the repeated computation instantly. Consistent-hash routing ensures that specific prefixes are deterministically mapped to specific workers, maximizing cache utility ("cache locality"), drastically reducing expensive ML compute operations, and lowering average latencies.
[Client/Benchmark]
|
v
+------------------+
| API Gateway | <-- Routes based on hash(prompt[:20])
| (Router Node) | or randomly.
+------------------+
/ | \
/ | \
v v v
[Worker 1] [Worker 2] [Worker 3]
(Cache A) (Cache B) (Cache C)
+ LLM + LLM + LLM
- Random Routing: A shared prefix has a
1/Nchance of hitting the worker that previously generated it. Cache hit rate scales poorly as cluster sizeNincreases. - Cache-Aware (Hash) Routing: A shared prefix has a
100%chance of hitting the designated worker. Consequently, the cluster-wide cache hit rate approaches the true duplicate request rate of the incoming traffic.
Results based on sending 6 interleaved requests for 2 distinct phrases:
| Routing Mode | Avg Latency (ms) | Cache Hit Rate | Total Requests |
|---|---|---|---|
| random | 75.39 | 83.3% | 6 |
| hash | 5.98 | 100.0% | 6 |
Use Docker Compose to spin up 3 worker nodes. The workers use a lightweight HuggingFace model (distilgpt2) to simulate realistic text generation workloads.
docker-compose up -dEnsure they are running on ports 8001, 8002, and 8003.
Install dependencies and run the router locally.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-router.txtRun in random mode:
python router.py --mode=random(In a separate terminal)
With the router running in random mode:
python benchmark.pyNote the metrics.
Stop the router (Ctrl+C), and restart it in hash mode:
python router.py --mode=hashRun the benchmark script again:
python benchmark.pyObserve the increased cache hit rate and reduced average and P95 latencies!
Tear down the worker nodes:
docker-compose down