Failed to quantify qwen1.5-110b-chat #430

Yiximail · 2024-04-27T13:35:44Z

Yiximail
Apr 27, 2024

My first try at quantizing a model of this size. Please correct me if I'm wrong.

I try to quantify qwen1.5-110b-chat on runpod
using template:
runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04

using exllamav2 main branch 84be945

#!/bin/bash
set -e

# cd /workspace
# git clone https://github.com/turboderp/exllamav2.git
# cd exllamav2
# python3 -m venv venv
# source venv/bin/activate
# pip install -r requirements.txt

cd /workspace/exllamav2
source venv/bin/activate

mkdir -p /workspace/exl2/
export TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0+PTX"
python convert.py \
    -i /workspace/Qwen1.5-110B-Chat/ \
    -o /workspace/exl2/ \
    -nr \
    -om /workspace/Qwen1.5-110B-Chat_measurement.json

echo("Chat_measurement Done")

mkdir -p /workspace/Qwen1.5-110B-Chat-exl2-3.25bpw/
python convert.py \
    -i /workspace/Qwen1.5-110B-Chat/ \
    -o /workspace/exl2/ \
    -nr \
    -m /workspace/Qwen1.5-110B-Chat_measurement.json \
    -cf /workspace/Qwen1.5-110B-Chat-exl2-3.25bpw/ \
    -b 3.25 \
    -hb 6

Generate measurement.json Successfully:
Qwen1.5-110B-Chat_measurement.json
Although it's very strange that all MLPs are accuracy: 0.00000100 I'm not sure that's normal.

Generate measurement.json Logs

 -- Beginning new job
 -- Input: models/Qwen1.5-110B-Chat/
 -- Output: workspace/
 -- Using default calibration dataset
 -- Measurement will be saved to Qwen1.5-110B-Chat_measurement.json
 !! Conversion script will end after measurement pass
 -- Tokenizing samples (measurement)...
 -- Token embeddings (measurement)...
 -- Measuring quantization impact...
 -- Layer: model.layers.0 (Attention)
 -- model.layers.0.self_attn.q_proj                    0.05:3b_64g/0.95:2b_64g s4                         2.12 bpw
 -- model.layers.0.self_attn.q_proj                    0.1:3b_64g/0.9:2b_64g s4                           2.17 bpw
 -- model.layers.0.self_attn.q_proj                    0.1:4b_128g/0.9:3b_128g s4                         3.15 bpw
 -- model.layers.0.self_attn.q_proj                    1:4b_128g s4                                       4.04 bpw
 -- model.layers.0.self_attn.q_proj                    1:4b_64g s4                                        4.07 bpw
 -- model.layers.0.self_attn.q_proj                    1:4b_32g s4                                        4.13 bpw
 -- model.layers.0.self_attn.q_proj                    0.1:5b_128g/0.9:4b_128g s4                         4.15 bpw
 -- model.layers.0.self_attn.q_proj                    0.1:5b_64g/0.9:4b_64g s4                           4.17 bpw
 -- model.layers.0.self_attn.q_proj                    0.1:5b_32g/0.9:4b_32g s4                           4.23 bpw
 -- model.layers.0.self_attn.q_proj                    0.1:6b_128g/0.9:5b_128g s4                         5.15 bpw
 -- model.layers.0.self_attn.q_proj                    0.1:6b_32g/0.9:5b_32g s4                           5.23 bpw
 -- model.layers.0.self_attn.q_proj                    1:6b_128g s4                                       6.04 bpw
 -- model.layers.0.self_attn.q_proj                    1:6b_32g s4                                        6.13 bpw
 -- model.layers.0.self_attn.q_proj                    1:8b_128g s4                                       8.04 bpw
 -- model.layers.0.self_attn.k_proj                    0.05:3b_64g/0.95:2b_64g s4                         2.15 bpw
 -- model.layers.0.self_attn.k_proj                    0.1:3b_64g/0.9:2b_64g s4                           2.20 bpw
 -- model.layers.0.self_attn.k_proj                    0.1:4b_128g/0.9:3b_128g s4                         3.17 bpw
 -- model.layers.0.self_attn.k_proj                    1:4b_128g s4                                       4.06 bpw
 -- model.layers.0.self_attn.k_proj                    1:4b_64g s4                                        4.10 bpw
 -- model.layers.0.self_attn.k_proj                    1:4b_32g s4                                        4.16 bpw
 -- model.layers.0.self_attn.k_proj                    0.1:5b_128g/0.9:4b_128g s4                         4.17 bpw
 -- model.layers.0.self_attn.k_proj                    0.1:5b_64g/0.9:4b_64g s4                           4.20 bpw
 -- model.layers.0.self_attn.k_proj                    0.1:5b_32g/0.9:4b_32g s4                           4.26 bpw
 -- model.layers.0.self_attn.k_proj                    0.1:6b_128g/0.9:5b_128g s4                         5.17 bpw
 -- model.layers.0.self_attn.k_proj                    0.1:6b_32g/0.9:5b_32g s4                           5.26 bpw
 -- model.layers.0.self_attn.k_proj                    1:6b_128g s4                                       6.06 bpw
 -- model.layers.0.self_attn.k_proj                    1:6b_32g s4                                        6.16 bpw
 -- model.layers.0.self_attn.k_proj                    1:8b_128g s4                                       8.06 bpw
 -- model.layers.0.self_attn.v_proj                    0.05:3b_64g/0.95:2b_64g s4                         2.15 bpw
 -- model.layers.0.self_attn.v_proj                    0.25:3b_64g/0.75:2b_64g s4                         2.35 bpw
 -- model.layers.0.self_attn.v_proj                    0.1:4b_128g/0.9:3b_128g s4                         3.17 bpw
 -- model.layers.0.self_attn.v_proj                    0.1:4b_64g/0.9:3b_64g s4                           3.20 bpw
 -- model.layers.0.self_attn.v_proj                    1:4b_128g s4                                       4.06 bpw
 -- model.layers.0.self_attn.v_proj                    1:4b_64g s4                                        4.10 bpw
 -- model.layers.0.self_attn.v_proj                    1:4b_32g s4                                        4.16 bpw
 -- model.layers.0.self_attn.v_proj                    0.1:5b_64g/0.9:4b_64g s4                           4.20 bpw
 -- model.layers.0.self_attn.v_proj                    0.1:5b_32g/0.9:4b_32g s4                           4.26 bpw
 -- model.layers.0.self_attn.v_proj                    1:5b_64g s4                                        5.10 bpw
 -- model.layers.0.self_attn.v_proj                    1:5b_32g s4                                        5.16 bpw
 -- model.layers.0.self_attn.v_proj                    1:6b_128g s4                                       6.06 bpw
 -- model.layers.0.self_attn.v_proj                    1:6b_32g s4                                        6.16 bpw
 -- model.layers.0.self_attn.v_proj                    1:8b_32g s4                                        8.16 bpw
 -- model.layers.0.self_attn.v_proj                    1:8b_128g s4                                       8.06 bpw
 -- model.layers.0.self_attn.o_proj                    0.05:3b_64g/0.95:2b_64g s4                         2.12 bpw
 -- model.layers.0.self_attn.o_proj                    0.1:3b_64g/0.9:2b_64g s4                           2.17 bpw
 -- model.layers.0.self_attn.o_proj                    0.1:4b_128g/0.9:3b_128g s4                         3.14 bpw
 -- model.layers.0.self_attn.o_proj                    1:4b_128g s4                                       4.04 bpw
 -- model.layers.0.self_attn.o_proj                    1:4b_64g s4                                        4.07 bpw
 -- model.layers.0.self_attn.o_proj                    1:4b_32g s4                                        4.13 bpw
 -- model.layers.0.self_attn.o_proj                    0.1:5b_128g/0.9:4b_128g s4                         4.14 bpw
 -- model.layers.0.self_attn.o_proj                    0.1:5b_64g/0.9:4b_64g s4                           4.17 bpw
 -- model.layers.0.self_attn.o_proj                    0.1:5b_32g/0.9:4b_32g s4                           4.23 bpw
 -- model.layers.0.self_attn.o_proj                    0.1:6b_128g/0.9:5b_128g s4                         5.14 bpw
 -- model.layers.0.self_attn.o_proj                    0.1:6b_32g/0.9:5b_32g s4                           5.23 bpw
 -- model.layers.0.self_attn.o_proj                    1:6b_128g s4                                       6.04 bpw
 -- model.layers.0.self_attn.o_proj                    1:6b_32g s4                                        6.13 bpw
 -- model.layers.0.self_attn.o_proj                    1:8b_128g s4                                       8.04 bpw
 -- 2.1254 bpw  accuracy: 0.98536929
 -- 2.1805 bpw  accuracy: 0.98628413
 -- 2.2265 bpw  accuracy: 0.99001208
 -- 2.6605 bpw  accuracy: 0.99120068
 -- 3.1487 bpw  accuracy: 0.99238037
 -- 3.1501 bpw  accuracy: 0.99391186
 -- 4.0394 bpw  accuracy: 0.99303818
 -- 4.0411 bpw  accuracy: 0.99470644
 -- 4.0742 bpw  accuracy: 0.99645142
 -- 4.1334 bpw  accuracy: 0.99657651
 -- 4.1501 bpw  accuracy: 0.99681964
 -- 4.1758 bpw  accuracy: 0.99760327
 -- 4.2222 bpw  accuracy: 0.99701843
 -- 4.2848 bpw  accuracy: 0.99781641
 -- 5.1982 bpw  accuracy: 0.99808632
 -- 5.2848 bpw  accuracy: 0.99888121
 -- 6.0394 bpw  accuracy: 0.99816180
 -- 6.2445 bpw  accuracy: 0.99935595
 -- 8.0394 bpw  accuracy: 0.99931837
--------------------------------------------
| Measured: model.layers.0 (Attention)     |
| Duration: 23.96 seconds                  |
| Completed step: 1/163                    |
| Avg time / step (rolling): 23.96 seconds |
| Estimated remaining time: 64min 41sec    |
| Last checkpoint layer: None              |
--------------------------------------------
 -- Layer: model.layers.0 (MLP)
 -- model.layers.0.mlp.gate_proj                       0.05:3b_64g/0.95:2b_64g s4                         2.12 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:3b_64g/0.9:2b_64g s4                           2.16 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:4b_128g/0.9:3b_128g s4                         3.14 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:4b_32g/0.9:3b_32g s4                           3.23 bpw
 -- model.layers.0.mlp.gate_proj                       1:4b_128g s4                                       4.03 bpw
 -- model.layers.0.mlp.gate_proj                       1:4b_32g s4                                        4.13 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:5b_128g/0.9:4b_128g s4                         4.14 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:5b_32g/0.9:4b_32g s4                           4.23 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:6b_128g/0.9:5b_128g s4                         5.14 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:6b_32g/0.9:5b_32g s4                           5.23 bpw
 -- model.layers.0.mlp.gate_proj                       1:6b_128g s4                                       6.03 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:8b_128g/0.9:6b_128g s4                         6.25 bpw
 -- model.layers.0.mlp.gate_proj                       1:8b_128g s4                                       8.03 bpw
 -- model.layers.0.mlp.up_proj                         0.05:3b_64g/0.95:2b_64g s4                         2.12 bpw
 -- model.layers.0.mlp.up_proj                         0.25:3b_64g/0.75:2b_64g s4                         2.31 bpw
 -- model.layers.0.mlp.up_proj                         0.3:3b_64g/0.7:2b_64g s4                           2.37 bpw
 -- model.layers.0.mlp.up_proj                         0.25:4b_128g/0.75:3b_128g s4                       3.28 bpw
 -- model.layers.0.mlp.up_proj                         0.25:4b_32g/0.75:3b_32g s4                         3.38 bpw
 -- model.layers.0.mlp.up_proj                         1:4b_32g s4                                        4.13 bpw
 -- model.layers.0.mlp.up_proj                         0.25:5b_128g/0.75:4b_128g s4                       4.28 bpw
 -- model.layers.0.mlp.up_proj                         0.25:5b_32g/0.75:4b_32g s4                         4.38 bpw
 -- model.layers.0.mlp.up_proj                         0.25:6b_128g/0.75:5b_128g s4                       5.28 bpw
 -- model.layers.0.mlp.up_proj                         0.25:6b_32g/0.75:5b_32g s4                         5.38 bpw
 -- model.layers.0.mlp.up_proj                         1:6b_128g s4                                       6.03 bpw
 -- model.layers.0.mlp.up_proj                         0.1:8b_128g/0.9:6b_128g s4                         6.25 bpw
 -- model.layers.0.mlp.up_proj                         1:8b_128g s4                                       8.03 bpw
 -- model.layers.0.mlp.down_proj                       0.05:6b_32g/0.2:3b_64g/0.75:2b_64g s4              2.47 bpw
 -- model.layers.0.mlp.down_proj                       0.05:5b_32g/0.95:3b_32g s4                         3.23 bpw
 -- model.layers.0.mlp.down_proj                       0.05:5b_32g/0.95:4b_32g s4                         4.18 bpw
 -- model.layers.0.mlp.down_proj                       0.05:8b_32g/0.1:4b_128g/0.85:3b_128g s4            3.39 bpw
 -- model.layers.0.mlp.down_proj                       0.05:8b_32g/0.1:4b_32g/0.85:3b_32g s4              3.48 bpw
 -- model.layers.0.mlp.down_proj                       0.05:8b_32g/0.95:4b_128g s4                        4.24 bpw
 -- model.layers.0.mlp.down_proj                       0.05:8b_32g/0.95:4b_32g s4                         4.33 bpw
 -- model.layers.0.mlp.down_proj                       0.05:8b_32g/0.1:5b_128g/0.85:4b_128g s4            4.34 bpw
 -- model.layers.0.mlp.down_proj                       0.05:8b_32g/0.1:5b_32g/0.85:4b_32g s4              4.43 bpw
 -- model.layers.0.mlp.down_proj                       0.05:8b_32g/0.1:6b_128g/0.85:5b_128g s4            5.29 bpw
 -- model.layers.0.mlp.down_proj                       0.05:8b_32g/0.1:6b_32g/0.85:5b_32g s4              5.38 bpw
 -- model.layers.0.mlp.down_proj                       0.05:8b_32g/0.95:6b_128g s4                        6.14 bpw
 -- model.layers.0.mlp.down_proj                       0.15:8b_128g/0.85:6b_128g s4                       6.34 bpw
 -- model.layers.0.mlp.down_proj                       1:8b_128g s4                                       8.04 bpw
 -- 2.2355 bpw  accuracy: 0.00000100
 -- 2.3162 bpw  accuracy: 0.00000100
 -- 2.5873 bpw  accuracy: 0.00000100
 -- 2.9039 bpw  accuracy: 0.00000100
 -- 3.2718 bpw  accuracy: 0.00000100
 -- 3.3610 bpw  accuracy: 0.00000100
 -- 3.6145 bpw  accuracy: 0.00000100
 -- 4.1327 bpw  accuracy: 0.00000100
 -- 4.1937 bpw  accuracy: 0.00000100
 -- 4.2551 bpw  accuracy: 0.00000100
 -- 4.3443 bpw  accuracy: 0.00000100
 -- 5.2384 bpw  accuracy: 0.00000100
 -- 5.3276 bpw  accuracy: 0.00000100
 -- 6.0680 bpw  accuracy: 0.00000100
 -- 6.2795 bpw  accuracy: 0.00000100
 -- 6.8455 bpw  accuracy: 0.00000100
 -- 8.0330 bpw  accuracy: 0.00000100
---------------------------------------------
| Measured: model.layers.0 (MLP)            |
| Duration: 191.55 seconds                  |
| Completed step: 2/163                     |
| Avg time / step (rolling): 107.75 seconds |
| Estimated remaining time: 289min 8sec     |
| Last checkpoint layer: None               |
---------------------------------------------

Then fail to quantize the model:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

So I add this to env:

export CUDA_LAUNCH_BLOCKING=1

and get the new error:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Get the new Error logs

 -- Beginning new job
 !! Warning: Output directory is not empty: /workspace/exl2/
 !! Cleaning output directory: /workspace/exl2/
 -- Input: /workspace/Qwen1.5-110B-Chat/
 -- Output: /workspace/exl2/
 -- Using default calibration dataset
 -- Target bits per weight: 4.125 (decoder), 6 (head)
 -- Max shard size: 8192 MB
 -- Reusing measurement: /workspace/Qwen1.5-110B-Chat_measurement.json
 -- Optimizing...
 -- Optimizing:    1/ 240
 -- Optimizing:    9/ 240
 -- Optimizing:   17/ 240
 -- Optimizing:   26/ 240
 -- Optimizing:   36/ 240
 -- Optimizing:   46/ 240
 -- Optimizing:   56/ 240
 -- Optimizing:   66/ 240
 -- Optimizing:   76/ 240
 -- Optimizing:   80/ 240
 -- Optimizing:   89/ 240
 -- Optimizing:   98/ 240
 -- Optimizing:  107/ 240
 -- Optimizing:  116/ 240
 -- Optimizing:  122/ 240
 -- Optimizing:  129/ 240
 -- Optimizing:  137/ 240
 -- Optimizing:  145/ 240
 -- Optimizing:  154/ 240
 -- Optimizing:  163/ 240
 -- Optimizing:  172/ 240
 -- Optimizing:  181/ 240
 -- Optimizing:  190/ 240
 -- Optimizing:  199/ 240
 -- Optimizing:  207/ 240
 -- Optimizing:  216/ 240
 -- Optimizing:  225/ 240
 -- Optimizing:  234/ 240
 -- max(err): 0.999999
 -- error_norm: 1.500000
 -- Quantization strategy:
 --   model.layers.0.self_attn                           6.0394 bpw - exp. error: 0.00185621
 --   model.layers.0.mlp                                 6.2795 bpw - exp. error: 0.99999900
 --   model.layers.1.self_attn                           4.2222 bpw - exp. error: 0.00104234
 --   model.layers.1.mlp                                 6.0680 bpw - exp. error: 0.99999900
 --   model.layers.2.self_attn                           4.2848 bpw - exp. error: 0.00250389
 --   model.layers.2.mlp                                 4.2551 bpw - exp. error: 0.99999900
 --   model.layers.3.self_attn                           5.2848 bpw - exp. error: 0.00113116
 --   model.layers.3.mlp                                 2.9039 bpw - exp. error: 0.99999900
 --   model.layers.4.self_attn                           5.1982 bpw - exp. error: 0.00123646
 --   model.layers.4.mlp                                 3.2718 bpw - exp. error: 0.99999900
 --   model.layers.5.self_attn                           4.2222 bpw - exp. error: 0.00181808
 --   model.layers.5.mlp                                 2.9039 bpw - exp. error: 0.99999900
 --   model.layers.6.self_attn                           5.1982 bpw - exp. error: 0.00105718
 --   model.layers.6.mlp                                 6.8455 bpw - exp. error: 0.99999900
 --   model.layers.7.self_attn                           4.2848 bpw - exp. error: 0.00191462
 --   model.layers.7.mlp                                 4.1327 bpw - exp. error: 0.99999900
 --   model.layers.8.self_attn                           4.2222 bpw - exp. error: 0.00325391
 --   model.layers.8.mlp                                 3.3610 bpw - exp. error: 0.99999900
 --   model.layers.9.self_attn                           5.1982 bpw - exp. error: 0.00182545
 --   model.layers.9.mlp                                 4.1327 bpw - exp. error: 0.99999900
 --   model.layers.10.self_attn                          6.2445 bpw - exp. error: 0.00051680
 --   model.layers.10.mlp                                3.3610 bpw - exp. error: 0.99999900
 --   model.layers.11.self_attn                          5.2848 bpw - exp. error: 0.00093155
 --   model.layers.11.mlp                                3.2718 bpw - exp. error: 0.99999900
 --   model.layers.12.self_attn                          6.0394 bpw - exp. error: 0.00087214
 --   model.layers.12.mlp                                4.1327 bpw - exp. error: 0.99999900
 --   model.layers.13.self_attn                          5.2848 bpw - exp. error: 0.00104977
 --   model.layers.13.mlp                                2.5873 bpw - exp. error: 0.99999900
 --   model.layers.14.self_attn                          5.1982 bpw - exp. error: 0.00130952
 --   model.layers.14.mlp                                6.8455 bpw - exp. error: 0.99999900
 --   model.layers.15.self_attn                          8.0394 bpw - exp. error: 0.00027560
 --   model.layers.15.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.16.self_attn                          6.0394 bpw - exp. error: 0.00093419
 --   model.layers.16.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.17.self_attn                          4.2222 bpw - exp. error: 0.00273902
 --   model.layers.17.mlp                                6.8455 bpw - exp. error: 0.99999900
 --   model.layers.18.self_attn                          4.2222 bpw - exp. error: 0.00255833
 --   model.layers.18.mlp                                5.2384 bpw - exp. error: 0.99999900
 --   model.layers.19.self_attn                          6.2445 bpw - exp. error: 0.00067527
 --   model.layers.19.mlp                                2.9039 bpw - exp. error: 0.99999900
 --   model.layers.20.self_attn                          8.0394 bpw - exp. error: 0.00026284
 --   model.layers.20.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.21.self_attn                          6.0394 bpw - exp. error: 0.00103127
 --   model.layers.21.mlp                                2.9039 bpw - exp. error: 0.99999900
 --   model.layers.22.self_attn                          5.1982 bpw - exp. error: 0.00111427
 --   model.layers.22.mlp                                3.3610 bpw - exp. error: 0.99999900
 --   model.layers.23.self_attn                          4.1758 bpw - exp. error: 0.00202266
 --   model.layers.23.mlp                                8.0330 bpw - exp. error: 0.99999900
 --   model.layers.24.self_attn                          5.1982 bpw - exp. error: 0.00088661
 --   model.layers.24.mlp                                3.3610 bpw - exp. error: 0.99999900
 --   model.layers.25.self_attn                          6.0394 bpw - exp. error: 0.00069564
 --   model.layers.25.mlp                                2.5873 bpw - exp. error: 0.99999900
 --   model.layers.26.self_attn                          4.0742 bpw - exp. error: 0.00203449
 --   model.layers.26.mlp                                3.2718 bpw - exp. error: 0.99999900
 --   model.layers.27.self_attn                          4.2848 bpw - exp. error: 0.00117181
 --   model.layers.27.mlp                                3.2718 bpw - exp. error: 0.99999900
 --   model.layers.28.self_attn                          6.2445 bpw - exp. error: 0.00032702
 --   model.layers.28.mlp                                2.3162 bpw - exp. error: 0.99999900
 --   model.layers.29.self_attn                          4.1334 bpw - exp. error: 0.00167353
 --   model.layers.29.mlp                                4.3443 bpw - exp. error: 0.99999900
 --   model.layers.30.self_attn                          8.0394 bpw - exp. error: 0.00017670
 --   model.layers.30.mlp                                4.3443 bpw - exp. error: 0.99999900
 --   model.layers.31.self_attn                          4.0742 bpw - exp. error: 0.00182828
 --   model.layers.31.mlp                                4.1937 bpw - exp. error: 0.99999900
 --   model.layers.32.self_attn                          4.2848 bpw - exp. error: 0.00112207
 --   model.layers.32.mlp                                4.1937 bpw - exp. error: 0.99999900
 --   model.layers.33.self_attn                          4.2848 bpw - exp. error: 0.00117394
 --   model.layers.33.mlp                                8.0330 bpw - exp. error: 0.99999900
 --   model.layers.34.self_attn                          4.1501 bpw - exp. error: 0.00153534
 --   model.layers.34.mlp                                4.1327 bpw - exp. error: 0.99999900
 --   model.layers.35.self_attn                          4.2222 bpw - exp. error: 0.00139224
 --   model.layers.35.mlp                                3.3610 bpw - exp. error: 0.99999900
 --   model.layers.36.self_attn                          5.2848 bpw - exp. error: 0.00057407
 --   model.layers.36.mlp                                4.2551 bpw - exp. error: 0.99999900
 --   model.layers.37.self_attn                          6.0394 bpw - exp. error: 0.00064936
 --   model.layers.37.mlp                                2.5873 bpw - exp. error: 0.99999900
 --   model.layers.38.self_attn                          8.0394 bpw - exp. error: 0.00020902
 --   model.layers.38.mlp                                6.8455 bpw - exp. error: 0.99999900
 --   model.layers.39.self_attn                          4.2222 bpw - exp. error: 0.00150912
 --   model.layers.39.mlp                                3.2718 bpw - exp. error: 0.99999900
 --   model.layers.40.self_attn                          4.0411 bpw - exp. error: 0.00241974
 --   model.layers.40.mlp                                5.3276 bpw - exp. error: 0.99999900
 --   model.layers.41.self_attn                          4.1758 bpw - exp. error: 0.00183688
 --   model.layers.41.mlp                                4.2551 bpw - exp. error: 0.99999900
 --   model.layers.42.self_attn                          4.1758 bpw - exp. error: 0.00212796
 --   model.layers.42.mlp                                3.6145 bpw - exp. error: 0.99999900
 --   model.layers.43.self_attn                          6.2445 bpw - exp. error: 0.00050553
 --   model.layers.43.mlp                                4.3443 bpw - exp. error: 0.99999900
 --   model.layers.44.self_attn                          4.0394 bpw - exp. error: 0.00311733
 --   model.layers.44.mlp                                5.3276 bpw - exp. error: 0.99999900
 --   model.layers.45.self_attn                          4.1334 bpw - exp. error: 0.00286056
 --   model.layers.45.mlp                                6.0680 bpw - exp. error: 0.99999900
 --   model.layers.46.self_attn                          8.0394 bpw - exp. error: 0.00034161
 --   model.layers.46.mlp                                3.6145 bpw - exp. error: 0.99999900
 --   model.layers.47.self_attn                          5.1982 bpw - exp. error: 0.00180593
 --   model.layers.47.mlp                                4.1327 bpw - exp. error: 0.99999900
 --   model.layers.48.self_attn                          5.2848 bpw - exp. error: 0.00152739
 --   model.layers.48.mlp                                5.2384 bpw - exp. error: 0.99999900
 --   model.layers.49.self_attn                          5.2848 bpw - exp. error: 0.00165798
 --   model.layers.49.mlp                                5.2384 bpw - exp. error: 0.99999900
 --   model.layers.50.self_attn                          5.1982 bpw - exp. error: 0.00204989
 --   model.layers.50.mlp                                5.2384 bpw - exp. error: 0.99999900
 --   model.layers.51.self_attn                          5.1982 bpw - exp. error: 0.00239956
 --   model.layers.51.mlp                                6.2795 bpw - exp. error: 0.99999900
 --   model.layers.52.self_attn                          5.1982 bpw - exp. error: 0.00264336
 --   model.layers.52.mlp                                2.3162 bpw - exp. error: 0.99999900
 --   model.layers.53.self_attn                          6.0394 bpw - exp. error: 0.00153000
 --   model.layers.53.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.54.self_attn                          8.0394 bpw - exp. error: 0.00050183
 --   model.layers.54.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.55.self_attn                          8.0394 bpw - exp. error: 0.00052105
 --   model.layers.55.mlp                                2.3162 bpw - exp. error: 0.99999900
 --   model.layers.56.self_attn                          5.1982 bpw - exp. error: 0.00283209
 --   model.layers.56.mlp                                3.3610 bpw - exp. error: 0.99999900
 --   model.layers.57.self_attn                          8.0394 bpw - exp. error: 0.00048868
 --   model.layers.57.mlp                                4.2551 bpw - exp. error: 0.99999900
 --   model.layers.58.self_attn                          5.2848 bpw - exp. error: 0.00269290
 --   model.layers.58.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.59.self_attn                          8.0394 bpw - exp. error: 0.00053277
 --   model.layers.59.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.60.self_attn                          8.0394 bpw - exp. error: 0.00055813
 --   model.layers.60.mlp                                4.2551 bpw - exp. error: 0.99999900
 --   model.layers.61.self_attn                          6.2445 bpw - exp. error: 0.00152436
 --   model.layers.61.mlp                                4.2551 bpw - exp. error: 0.99999900
 --   model.layers.62.self_attn                          6.2445 bpw - exp. error: 0.00149789
 --   model.layers.62.mlp                                3.3610 bpw - exp. error: 0.99999900
 --   model.layers.63.self_attn                          6.2445 bpw - exp. error: 0.00154801
 --   model.layers.63.mlp                                2.5873 bpw - exp. error: 0.99999900
 --   model.layers.64.self_attn                          6.0394 bpw - exp. error: 0.00217353
 --   model.layers.64.mlp                                5.2384 bpw - exp. error: 0.99999900
 --   model.layers.65.self_attn                          6.2445 bpw - exp. error: 0.00153950
 --   model.layers.65.mlp                                3.6145 bpw - exp. error: 0.99999900
 --   model.layers.66.self_attn                          6.0394 bpw - exp. error: 0.00214442
 --   model.layers.66.mlp                                4.1327 bpw - exp. error: 0.99999900
 --   model.layers.67.self_attn                          6.0394 bpw - exp. error: 0.00231385
 --   model.layers.67.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.68.self_attn                          6.2445 bpw - exp. error: 0.00152271
 --   model.layers.68.mlp                                2.5873 bpw - exp. error: 0.99999900
 --   model.layers.69.self_attn                          8.0394 bpw - exp. error: 0.00065605
 --   model.layers.69.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.70.self_attn                          6.0394 bpw - exp. error: 0.00221535
 --   model.layers.70.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.71.self_attn                          6.2445 bpw - exp. error: 0.00150063
 --   model.layers.71.mlp                                3.2718 bpw - exp. error: 0.99999900
 --   model.layers.72.self_attn                          5.1982 bpw - exp. error: 0.00332810
 --   model.layers.72.mlp                                5.2384 bpw - exp. error: 0.99999900
 --   model.layers.73.self_attn                          6.2445 bpw - exp. error: 0.00138054
 --   model.layers.73.mlp                                5.3276 bpw - exp. error: 0.99999900
 --   model.layers.74.self_attn                          8.0394 bpw - exp. error: 0.00061571
 --   model.layers.74.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.75.self_attn                          8.0394 bpw - exp. error: 0.00066418
 --   model.layers.75.mlp                                3.2718 bpw - exp. error: 0.99999900
 --   model.layers.76.self_attn                          8.0394 bpw - exp. error: 0.00061458
 --   model.layers.76.mlp                                4.1937 bpw - exp. error: 0.99999900
 --   model.layers.77.self_attn                          5.2848 bpw - exp. error: 0.00277078
 --   model.layers.77.mlp                                4.1937 bpw - exp. error: 0.99999900
 --   model.layers.78.self_attn                          5.2848 bpw - exp. error: 0.00314292
 --   model.layers.78.mlp                                2.2355 bpw - exp. error: 0.99999900
 --   model.layers.79.self_attn                          8.0394 bpw - exp. error: 0.00037258
 --   model.layers.79.mlp                                2.2355 bpw - exp. error: 0.99999900
 -- sum(log(err)): -538.376159
 -- max(err): 0.999999
 -- Tokenizing samples...
 -- Token embeddings again...
 -- Quantizing...
 -- Layer: model.layers.0 (Attention)
 -- Linear: model.layers.0.self_attn.q_proj -> 1:6b_128g s4, 6.04 bpw
 -- Linear: model.layers.0.self_attn.k_proj -> 1:6b_128g s4, 6.06 bpw
 -- Linear: model.layers.0.self_attn.v_proj -> 1:6b_128g s4, 6.06 bpw
 -- Linear: model.layers.0.self_attn.o_proj -> 1:6b_128g s4, 6.04 bpw
 -- Module quantized, rfn_error: 0.001580
 -- Layer: model.layers.0 (MLP)
 -- Linear: model.layers.0.mlp.gate_proj -> 0.1:8b_128g/0.9:6b_128g s4, 6.25 bpw
 -- Linear: model.layers.0.mlp.up_proj -> 0.1:8b_128g/0.9:6b_128g s4, 6.25 bpw
 -- Linear: model.layers.0.mlp.down_proj -> 0.15:8b_128g/0.85:6b_128g s4, 6.34 bpw
Traceback (most recent call last):
  File "/workspace/exllamav2/convert.py", line 268, in <module>
    quant(job, save_job, model)
  File "/workspace/exllamav2/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/exllamav2/conversion/quantize.py", line 406, in quant
    quant_mlp(job, module, hidden_states, target_states, quantizers, attn_params, strat)
  File "/workspace/exllamav2/conversion/quantize.py", line 177, in quant_mlp
    quant_linear(job, module.down_proj, quantizers["down_proj"], strat["down_proj"])
  File "/workspace/exllamav2/conversion/quantize.py", line 63, in quant_linear
    lq.quantize(keep_qweight = True, apply = True)
  File "/workspace/exllamav2/conversion/adaptivegptq.py", line 514, in quantize
    ext_c.quantize_range(self.quant,
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Answered by turboderp

Apr 27, 2024

Yes, I've been working on this myself. It turns out there are a couple of int overflow bugs while quantizing that had to be addressed (see dev branch).

There's still one I haven't sorted out yet. During quantizing it does a sanity check by multiplying the quantized matrix with an identity matrix using the custom kernels, and for the MLPs in this model one of those identity matrices has a shape of 48k x 48k, which is more than 2^31 elements and guess what? :)

Using the dev branch, bypassing that sanity check (e.g. set diff2 = 0 on line 116 of conversion/quantize.py) and running on a GPU with at least 48 GB of VRAM to accommodate the enormous matrices in this model, you should be able to qu…

View full answer

turboderp · 2024-04-27T17:31:42Z

turboderp
Apr 27, 2024
Maintainer

Yes, I've been working on this myself. It turns out there are a couple of int overflow bugs while quantizing that had to be addressed (see dev branch).

There's still one I haven't sorted out yet. During quantizing it does a sanity check by multiplying the quantized matrix with an identity matrix using the custom kernels, and for the MLPs in this model one of those identity matrices has a shape of 48k x 48k, which is more than 2^31 elements and guess what? :)

Using the dev branch, bypassing that sanity check (e.g. set diff2 = 0 on line 116 of conversion/quantize.py) and running on a GPU with at least 48 GB of VRAM to accommodate the enormous matrices in this model, you should be able to quantize it now. I've been doing it myself on RunPod for the chat version.

Results are going here. If nothing else you can grab the measurement.json file from there and skip 4 hours or preprocessing.

3 replies

Yiximail Apr 27, 2024
Author

Thank you very much for your reply, I thought I had screwed something up.
The first time I quantized such a large model was also the first time I operated it remotely.

I don't know much about the code logic, so I chose to use your file directly.
Greate work.

Yiximail Apr 28, 2024
Author

It works very well. Thanks again.

By the way, if I don't set TORCH_CUDA_ARCH_LIST I get a warning.

/workspace/exllamav2/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(

Does it matter?

turboderp Apr 28, 2024
Maintainer

The warning is a new thing in Torch 2.3. Not sure why they felt it was necessary, but you can safely ignore it, and it should only pop up once anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to quantify qwen1.5-110b-chat #430

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Failed to quantify qwen1.5-110b-chat #430

Yiximail Apr 27, 2024

Replies: 1 comment · 3 replies

turboderp Apr 27, 2024 Maintainer

Yiximail Apr 27, 2024 Author

Yiximail Apr 28, 2024 Author

turboderp Apr 28, 2024 Maintainer

Yiximail
Apr 27, 2024

Replies: 1 comment 3 replies

turboderp
Apr 27, 2024
Maintainer

Yiximail Apr 27, 2024
Author

Yiximail Apr 28, 2024
Author

turboderp Apr 28, 2024
Maintainer