-
My first try at quantizing a model of this size. Please correct me if I'm wrong. I try to quantify qwen1.5-110b-chat on runpod using exllamav2 main branch 84be945 #!/bin/bash
set -e
# cd /workspace
# git clone https://github.com/turboderp/exllamav2.git
# cd exllamav2
# python3 -m venv venv
# source venv/bin/activate
# pip install -r requirements.txt
cd /workspace/exllamav2
source venv/bin/activate
mkdir -p /workspace/exl2/
export TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0+PTX"
python convert.py \
-i /workspace/Qwen1.5-110B-Chat/ \
-o /workspace/exl2/ \
-nr \
-om /workspace/Qwen1.5-110B-Chat_measurement.json
echo("Chat_measurement Done")
mkdir -p /workspace/Qwen1.5-110B-Chat-exl2-3.25bpw/
python convert.py \
-i /workspace/Qwen1.5-110B-Chat/ \
-o /workspace/exl2/ \
-nr \
-m /workspace/Qwen1.5-110B-Chat_measurement.json \
-cf /workspace/Qwen1.5-110B-Chat-exl2-3.25bpw/ \
-b 3.25 \
-hb 6 Generate measurement.json Successfully: Generate measurement.json Logs
Then fail to quantize the model:
So I add this to env: export CUDA_LAUNCH_BLOCKING=1 and get the new error:
Get the new Error logs
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Yes, I've been working on this myself. It turns out there are a couple of int overflow bugs while quantizing that had to be addressed (see dev branch). There's still one I haven't sorted out yet. During quantizing it does a sanity check by multiplying the quantized matrix with an identity matrix using the custom kernels, and for the MLPs in this model one of those identity matrices has a shape of 48k x 48k, which is more than 2^31 elements and guess what? :) Using the dev branch, bypassing that sanity check (e.g. set Results are going here. If nothing else you can grab the measurement.json file from there and skip 4 hours or preprocessing. |
Beta Was this translation helpful? Give feedback.
Yes, I've been working on this myself. It turns out there are a couple of int overflow bugs while quantizing that had to be addressed (see dev branch).
There's still one I haven't sorted out yet. During quantizing it does a sanity check by multiplying the quantized matrix with an identity matrix using the custom kernels, and for the MLPs in this model one of those identity matrices has a shape of 48k x 48k, which is more than 2^31 elements and guess what? :)
Using the dev branch, bypassing that sanity check (e.g. set
diff2 = 0
on line 116 of conversion/quantize.py) and running on a GPU with at least 48 GB of VRAM to accommodate the enormous matrices in this model, you should be able to qu…