Skip to content

simpx/llmcalc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

llmcalc

Interactive calculator for LLM serving — compute, memory, and KV cache bandwidth estimation.

Live demo: https://simpx.github.io/llmcalc/

Features

  • Architecture analysis: parses model config (params, KV/token, per-token FLOPs)
  • Deployment planning: GPUs per instance, DP instances per machine
  • Hardware fitting: memory allocation bars for GPU HBM
  • Workload bandwidth estimation: bucket-based analysis for agentic multi-turn workloads
  • Topology visualization: see instances, memory layout, and traffic flow at a glance

Supported Architectures

Architecture Example models Status
MLA + DSA + MoE GLM-5, DeepSeek-V3.2 ✅ Built-in + config.json paste
Hybrid GQA + Linear Attn + MoE Qwen-series hybrids ✅ Via config.json paste
GQA + MoE DeepSeek-V3, Mixtral 🚧 Planned
Dense MHA/GQA LLaMA-3, Qwen3 dense 🚧 Planned

Quick Start

# Clone and open
git clone https://github.com/simpx/llmcalc.git
cd llmcalc
open index.html   # or: python -m http.server 8000

Or use the hosted version at https://simpx.github.io/llmcalc/

Usage Flow

  1. Select model: Preset (GLM-5) or paste a HuggingFace config.json
  2. Deployment: GPUs per instance and DP replicas per machine
  3. Hardware: GPU type, HBM size, MFU
  4. Workload: Buckets with (T, h) per bucket
  5. Hit rate: Local cache hit rate to compute network bandwidth

The traffic overview panel on the right updates in real time as you change parameters.

Formulas

All derived values are computed from architecture params:

avg_pos     = T × (1 + h) / 2
FLOPs/tok   = LinearConst + PosCoef × avg_pos
X           = (Peak × MFU × 10⁶) / FLOPs/tok       tokens/s
Write BW    = X × KV_per_token                     GiB/s
Read raw    = Write × h / (1 - h)                  GiB/s (amortized)
External BW = Read raw × (1 - h_local)             GiB/s (goes to network)

For MLA + DSA:

  • LinearConst = projections + FFN + lm_head + MLA bounded attn body
  • PosCoef = 2 × index_n_heads × index_head_dim × num_layers / 10⁹
  • KV/token = (kv_lora_rank × bytes + qk_rope_dim × bytes + index_head_dim × bytes) × num_layers

Tech Stack

Single-file HTML with CDN-hosted dependencies:

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages