QMoE

This repository contains the full code of the paper QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models.

It is organized as follows:

datautils.py: utilities for dataset loading
gptq.py: robust batch-implementation of GPTQ
quant.py: quantization utilities
sub1.py: efficient inference of compressed models
sub1_cuda_kernel.cu: CUDA kernels
switch.py: the efficient QMoE compression framework
test.py: per-layer benchmarks and ideal compression rates

Dependencies

The project was developed with:

torch==2.0.0+cu117
transformers==4.28.0
datasets==2.10.1
CUDA 11.4 GPU drivers

CUDA kernels for compressed storage and inference can be installed via:

python setup_cuda.py install

Usage

Now follows a list of sample commands for running different experiments.

# BF16 baseline eval on C4 
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 
# BF16 baseline eval on additional datasets 
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --detaileval
# ternary round to nearest baseline 
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --nearest 

# ternary compression with QMoE, saving the compressed model for later inference
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --trainsamples 10000 --save PATH_TO_COMP_MODEL
# 2-bit compression with QMoE
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 2 --trainsamples 10000

# test kernels and compute ideal compression rates 
CUDA_VISIBLE_DEVICES=0 python test.py
# run per-layer benchmarks
CUDA_VISIBLE_DEVICES=0 python test.py --benchmark

# run eval of stored compressed model
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --valsamples 128 
# run end-to-end benchmark
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128
# run simulated end-to-end benchmark for BF16
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128 --simul

In general, you can pass google/switch-large-128 and google/switch-c-2048 to run on large-128 and c-2048, respectively. We note that other SwitchTransformer models than those 3 may not work out-of-the-box due to Hugging Face bugs.

Always specify CUDA_VISIBLE_DEVICES since some commands, like sub1.py, will otherwise attempt to use all available GPUs.

Compressed Models

Our models in compressed custom QMoE format are available on Hugging Face: base-128, large-128 and c-2048. To use them, clone the repository and then simply pass their path to sub1.py.

Cite

If you found this work useful, please consider citing:

@article{frantar-qmoe,
  title={{QMoE}: Practical Sub-1-Bit Compression of Trillion-Parameter Models}
  author={Elias Frantar and Dan Alistarh},
  year={2023},
  journal={arXiv preprint, arxiv:2310.16795}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QMoE

Dependencies

Usage

Compressed Models

Cite

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
datautils.py		datautils.py
gptq.py		gptq.py
quant.py		quant.py
setup_cuda.py		setup_cuda.py
sub1.py		sub1.py
sub1_cuda.cpp		sub1_cuda.cpp
sub1_cuda_kernel.cu		sub1_cuda_kernel.cu
switch.py		switch.py
test.py		test.py

License

xiechengmude/qmoe

Folders and files

Latest commit

History

Repository files navigation

QMoE

Dependencies

Usage

Compressed Models

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages