Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aqlm/inference_kernels/cuda_kernel.py #104

Closed
timo-obrecht opened this issue Jun 17, 2024 · 2 comments
Closed

aqlm/inference_kernels/cuda_kernel.py #104

timo-obrecht opened this issue Jun 17, 2024 · 2 comments

Comments

@timo-obrecht
Copy link

Hello. I am having a compilation error when trying AQLM models. Inference (and therefor training) of quantized models fails. This is observed with LLama2 7B, Llama3 8B and Mistral-7B.

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer
from peft import LoraConfig


model_id = "ISTA-DASLab/Mistral-7B-v0.1-AQLM-PV-2Bit-1x16-hf"

quantized_model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map=device
)

lora_config = LoraConfig(
    target_modules=["q_proj", "k_proj"],
    init_lora_weights=False
)

quantized_model.add_adapter(lora_config, adapter_name="laura")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

input = tokenizer(["There is a problem somewhere"])
quantized_model(
    input_ids=torch.tensor(input['input_ids']).to(device),
    attention_mask=torch.tensor(input['attention_mask']).to(device)
    )
File ~/work/.venv/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.py:8
      5 from torch.utils.cpp_extension import load
      7 CUDA_FOLDER = os.path.dirname(os.path.abspath(__file__))
----> 8 CUDA_KERNEL = load(
      9     name="codebook_cuda",
     10     sources=[os.path.join(CUDA_FOLDER, "cuda_kernel.cpp"), os.path.join(CUDA_FOLDER, "cuda_kernel.cu")],
     11 )
     13 torch.library.define(
     14     "aqlm::code1x16_matmat", "(Tensor input, Tensor codes, Tensor codebooks, Tensor scales, Tensor bias) -> Tensor"
     15 )
     17 torch.library.impl("aqlm::code1x16_matmat", "default", CUDA_KERNEL.code1x16_matmat)

The compilation of the CUDA extension fails.

RuntimeError: Error building extension 'codebook_cuda': [1/2] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/work/.venv/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
FAILED: cuda_kernel.cuda.o 
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /root/work/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/work/.venv/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
ninja: build stopped: subcommand failed.

aqlm version : 1.1.6
torch : 2.3.1+cu121
nvcc version : 12.1
gcc version : 10.5.0
GPU : RTX 3070

@BlackSamorez
Copy link
Collaborator

@timo-obrecht
Copy link
Author

I already saw this one, that is why I was already using gcc-10. But apparently you need to specify to nvcc which compiler to use indepedently of system settings. The issue was only fixed by adding this to the .bashrc :

export NVCC_PREPEND_FLAGS='-ccbin /usr/bin/g++-10'

This is straight from nvidia documentation. (I also had to reboot, somehow sourcing the .bashrc was not enough)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants