Using CUDA_VISIBLE_DEVICES set to anything else than 0 causes invalid ordinal error on CUDA #670

J4Q8 · 2024-06-19T18:12:10Z

Hi!

I am using mistralai/Mistral-7B-v0.1 and when I set CUDA_VISIBLE_DEVICES=1 (or anything else than 0) it gives me the error below. After checking I think that the culprit is this line. It happens because you use the number specified in CUDA_VISIBLE_DEVICES, but when this parameter is specified the gpus are reordered starting with zero. Perhaps just using "cuda:0" is sufficient there. Could you fix this?

Thanks in advance!

P.S. I just tried this with mistral model but perhaps it concerns your other supported models as well.

Error trace:

Traceback (most recent call last):
  File "/home/jlucki/s2024-safety-interp/experiments/robustness_evaluation/finetune.py", line 98, in <module>
    model = FastLanguageModel.get_peft_model(
  File "/local/home/jlucki/mambaforge/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/llama.py", line 1709, in get_peft_model
    model = FastLlamaModel.patch_peft_model(model, use_gradient_checkpointing)
  File "/local/home/jlucki/mambaforge/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/llama.py", line 1908, in patch_peft_model
    extra_ignored_labels = torch.full((max_seq_length, 1), -100, device = device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

danielhanchen · 2024-06-20T13:37:51Z

Oh interesting hmm

danielhanchen · 2024-06-20T14:29:23Z

Oh fixed! This actually resolved some issues in #660 ! Thanks!

danielhanchen closed this as completed Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using CUDA_VISIBLE_DEVICES set to anything else than 0 causes invalid ordinal error on CUDA #670

Using CUDA_VISIBLE_DEVICES set to anything else than 0 causes invalid ordinal error on CUDA #670

J4Q8 commented Jun 19, 2024

danielhanchen commented Jun 20, 2024

danielhanchen commented Jun 20, 2024

Using CUDA_VISIBLE_DEVICES set to anything else than 0 causes invalid ordinal error on CUDA #670

Using CUDA_VISIBLE_DEVICES set to anything else than 0 causes invalid ordinal error on CUDA #670

Comments

J4Q8 commented Jun 19, 2024

danielhanchen commented Jun 20, 2024

danielhanchen commented Jun 20, 2024