This repository contains the full code of the paper QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models.
It is organized as follows:
datautils.py
: utilities for dataset loadinggptq.py
: robust batch-implementation of GPTQquant.py
: quantization utilitiessub1.py
: efficient inference of compressed modelssub1_cuda_kernel.cu
: CUDA kernelsswitch.py
: the efficient QMoE compression frameworktest.py
: per-layer benchmarks and ideal compression rates
The project was developed with:
torch==2.0.0+cu117
transformers==4.28.0
datasets==2.10.1
- CUDA 11.4 GPU drivers
CUDA kernels for compressed storage and inference can be installed via:
python setup_cuda.py install
Now follows a list of sample commands for running different experiments.
# BF16 baseline eval on C4
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128
# BF16 baseline eval on additional datasets
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --detaileval
# ternary round to nearest baseline
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --nearest
# ternary compression with QMoE, saving the compressed model for later inference
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --trainsamples 10000 --save PATH_TO_COMP_MODEL
# 2-bit compression with QMoE
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 2 --trainsamples 10000
# test kernels and compute ideal compression rates
CUDA_VISIBLE_DEVICES=0 python test.py
# run per-layer benchmarks
CUDA_VISIBLE_DEVICES=0 python test.py --benchmark
# run eval of stored compressed model
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --valsamples 128
# run end-to-end benchmark
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128
# run simulated end-to-end benchmark for BF16
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128 --simul
In general, you can pass google/switch-large-128
and google/switch-c-2048
to run on large-128 and c-2048, respectively. We note that other SwitchTransformer models than those 3 may not work out-of-the-box due to Hugging Face bugs.
Always specify CUDA_VISIBLE_DEVICES
since some commands, like sub1.py
, will otherwise attempt to use all available GPUs.
Our models in compressed custom QMoE format are available on Hugging Face: base-128, large-128 and c-2048. To use them, clone the repository and then simply pass their path to sub1.py
.
If you found this work useful, please consider citing:
@article{frantar-qmoe,
title={{QMoE}: Practical Sub-1-Bit Compression of Trillion-Parameter Models}
author={Elias Frantar and Dan Alistarh},
year={2023},
journal={arXiv preprint, arxiv:2310.16795}
}