GitHub - microsoft/tokenweave: Efficient Compute-Communication Overlap for Distributed LLM Inference

TokenWeave

Efficient Compute-Communication Overlap for Distributed LLM Inference

Overview

TokenWeave is a system designed to reduce communication overhead during distributed inference of large language models (LLMs). Even with high-speed interconnects like NVLink, distributed inference can incur up to 20% performance overhead due to communication bottlenecks.

TokenWeave addresses this by introducing a coarse-grained compute-communication overlap mechanism that significantly improves efficiency during inference. TokenWeave is currently integrated with LLama-3.3-70B, Qwen2.5-72B and Mixtral-8x22B but it can be easily extended to other similar models by modifying the model file. Please see how we modify llama.py to integrate TokenWeave for the steps required to integrate TokenWeave into an existing model file. Additionally, please check csrc/tokenweave_fused_kernels.cu for the TokenWeave fused kernels used to implement compute-communication overlap.

TokenWeave NVIDIA Nsight Systems (nsys) profile

TokenWeave-nsys-profile

Prerequisites

Compilation: CUDA 12.4
Runtime environment: Python 3.12, PyTorch 2.6.0, Ubuntu 22.04
Hardware: 8×H100 DGX system with NVLink interconnects

Installation

To ease the setup, we recommend using either of these two Docker images:

pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel or
vllm/vllm-openai:v0.8.5

apt-get update; apt-get upgrade -y; apt-get install kmod git build-essential tmux -y   
git clone https://github.com/microsoft/tokenweave.git
cd tokenweave
# Install miniconda; skip if already installed
make install_miniconda # 30 seconds
make create_env
bash # Refresh shell and activate
conda activate tokenweave
make install # 18 minutes
# or alternatively:
pip3 install -v -e .
make install_dependencies # 17 seconds

Quick Start

To get started with TokenWeave:

huggingface-cli login --token HF_TOKEN
# Run offline inference examples
make run_qwen2
make run_mixtral
make run_llama3
# NOTE: If Llama 3 gets stuck during the model downloading stage, 
# please kill the process and start it again — that should resolve the issue.
# Note: vLLM version 0.8.5.post1 may also hang during model downloading, depending 
# on the environment setup.

To Generate Tokenweave Configs (Optional)

If you want to generate TokenWeave configs for a new model, you can use the configs_generator script and modify it as needed. We have already provided configs for LLama-3.3-70B, Qwen2.5-72B and Mixtral-8x22B on 8xH100.

cd artifact
tmux new -s tokenweave_session  # Start a new tmux session
conda activate tokenweave       # Activate the conda environment
# Run the following command in the tmux session to generate configs for
# `LLaMA-3.3-70B`, `Qwen2.5-72B`, and `Mixtral-8x22B`
make configs_generator          # Takes approximately 1 day
cd .. # Go back to the tokenweave directory

To profile using nsys

# install nsys
wget https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2024_4/NsightSystems-linux-cli-public-2024.4.1.61-3431596.deb
dpkg -i NsightSystems-linux-cli-public-2024.4.1.61-3431596.deb
# run
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node <script_name> <arguments>

Benchmarks

Our evaluation includes two types of experiments:

Microbenchmark performance (Figures 1, 3, 4, 5, 6, 7, and 10)
End-to-end LLM performance (Figures 11, 12, and 13)

To reproduce the results, use the Makefile in the artifact/ directory:

cd artifact
tmux new -s tokenweave_session # start a new tmux session
conda activate tokenweave # activate the conda environment
# run the following commands in the tmux session
make clean
make correctness_check # check output/ directory for the raw text generated
make all # ~10 hours 48 minutes
# To generate the figures piece-wise
make figure_5_6_7 # 20 minutes
make figure_4_10 # 1 hour 25 minutes
make figure_9 # 8 minutes
make figure_1_3 # 3 hours 25 minutes
make figure_2_11 # 1 hour 10 minutes
make figure_12 # 2 hours 34 minutes
make figure_13 # 1 hour 52 minutes

The artifact scripts redirect the raw output numbers and logs to the output/ folder, while the plotted graphs are stored in the graphs/ folder. CSV files for the figures can be found in the csvs/ directory. Results may show minor runtime variations compared to those reported in the paper, but the general trends should remain consistent.

Citation

If you use our work, please consider citing our paper:

@misc{gond2025tokenweave,
  title={TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference},
  author={Raja Gond and Nipun Kwatra and Ramachandran Ramjee},
  year={2025},
  url={https://arxiv.org/abs/2505.11329}
}

Acknowledgment

This repository originally started as a fork of the vLLM project (Commit ID: 87aaade). Multimem-NVLS communication collective operation kernels in TokenWeave are built on top of the pytorch implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
artifact		artifact
assets		assets
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docs		docs
examples		examples
requirements		requirements
tests		tests
third_party_licenses		third_party_licenses
tools		tools
vllm		vllm
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
cgmanifest.json		cgmanifest.json
find_cuda_init.py		find_cuda_init.py
format.sh		format.sh
pyproject.toml		pyproject.toml
setup.py		setup.py
tokenweave_offline_example.py		tokenweave_offline_example.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficient Compute-Communication Overlap for Distributed LLM Inference

Overview

TokenWeave NVIDIA Nsight Systems (nsys) profile

Prerequisites

Installation

Quick Start

Benchmarks

Citation

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

microsoft/tokenweave

Folders and files

Latest commit

History

Repository files navigation

Efficient Compute-Communication Overlap for Distributed LLM Inference

Overview

TokenWeave NVIDIA Nsight Systems (nsys) profile

Prerequisites

Installation

Quick Start

Benchmarks

Citation

Acknowledgment

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages