aqlm/inference_kernels/cuda_kernel.py #104

timo-obrecht · 2024-06-17T16:44:47Z

Hello. I am having a compilation error when trying AQLM models. Inference (and therefor training) of quantized models fails. This is observed with LLama2 7B, Llama3 8B and Mistral-7B.

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer
from peft import LoraConfig


model_id = "ISTA-DASLab/Mistral-7B-v0.1-AQLM-PV-2Bit-1x16-hf"

quantized_model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map=device
)

lora_config = LoraConfig(
    target_modules=["q_proj", "k_proj"],
    init_lora_weights=False
)

quantized_model.add_adapter(lora_config, adapter_name="laura")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

input = tokenizer(["There is a problem somewhere"])
quantized_model(
    input_ids=torch.tensor(input['input_ids']).to(device),
    attention_mask=torch.tensor(input['attention_mask']).to(device)
    )

File ~/work/.venv/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.py:8
      5 from torch.utils.cpp_extension import load
      7 CUDA_FOLDER = os.path.dirname(os.path.abspath(__file__))
----> 8 CUDA_KERNEL = load(
      9     name="codebook_cuda",
     10     sources=[os.path.join(CUDA_FOLDER, "cuda_kernel.cpp"), os.path.join(CUDA_FOLDER, "cuda_kernel.cu")],
     11 )
     13 torch.library.define(
     14     "aqlm::code1x16_matmat", "(Tensor input, Tensor codes, Tensor codebooks, Tensor scales, Tensor bias) -> Tensor"
     15 )
     17 torch.library.impl("aqlm::code1x16_matmat", "default", CUDA_KERNEL.code1x16_matmat)

The compilation of the CUDA extension fails.

RuntimeError: Error building extension 'codebook_cuda': [1/2] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/work/.venv/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
FAILED: cuda_kernel.cuda.o 
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/work/.venv/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
ninja: build stopped: subcommand failed.

aqlm version : 1.1.6
torch : 2.3.1+cu121
nvcc version : 12.1
gcc version : 10.5.0
GPU : RTX 3070

The text was updated successfully, but these errors were encountered:

BlackSamorez · 2024-06-17T20:21:44Z

@timo-obrecht
NVlabs/instant-ngp#119

timo-obrecht · 2024-06-18T08:15:33Z

I already saw this one, that is why I was already using gcc-10. But apparently you need to specify to nvcc which compiler to use indepedently of system settings. The issue was only fixed by adding this to the .bashrc :

export NVCC_PREPEND_FLAGS='-ccbin /usr/bin/g++-10'

This is straight from nvidia documentation. (I also had to reboot, somehow sourcing the .bashrc was not enough)

timo-obrecht closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aqlm/inference_kernels/cuda_kernel.py #104

aqlm/inference_kernels/cuda_kernel.py #104

timo-obrecht commented Jun 17, 2024

BlackSamorez commented Jun 17, 2024

timo-obrecht commented Jun 18, 2024

aqlm/inference_kernels/cuda_kernel.py #104

aqlm/inference_kernels/cuda_kernel.py #104

Comments

timo-obrecht commented Jun 17, 2024

BlackSamorez commented Jun 17, 2024

timo-obrecht commented Jun 18, 2024