runtime error: mat1 and mat2 shapes cannot be multiplied #8

ijustloveses · 2023-03-15T10:47:40Z

File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 1628, in train
return inner_training_loop(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 1895, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 2637, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 2669, in compute_loss
outputs = model(**inputs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 529, in forward
return self.base_model(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 852, in forward
outputs = self.model.decoder(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 616, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 612, in custom_forward
return module(*inputs, output_attentions, None)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 167, in forward
value_states = self.v_proj(hidden_states).view(bsz, tgt_len, self.num_heads, self.head_dim)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 522, in forward
result = super().forward(x)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x7 and 8x4096)

ijustloveses · 2023-03-15T11:47:51Z

A6000, same error happened with one GPU or two GPUs training.
Is it caused by bitsandbytes?

kooshi · 2023-03-15T13:00:22Z

Yeah, seems to be: bitsandbytes-foundation/bitsandbytes#162

gururise · 2023-03-17T03:55:47Z

Strange, on my friend's machine with a 2x3090, he gets this error (Ubuntu 20.04 - Cuda 11.2), even though training on a single 3090. However, on another machine I'm using on the cloud with similar specs, but with only 1x3090 (Ubuntu 20.04 - Cuda 12.1) I do not get this error.

Could it have something to do with having 2x 3090's installed, even though only training on one? Or maybe it could be the cuda version? The working machine with a single 3090 is running Cuda 12

saimarpaka · 2023-03-18T04:08:19Z

ran into the same with 4xV100. any tips appreciated

gururise · 2023-03-18T07:53:54Z

Force training only on a single GPU fixed it for us:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

AngainorDev · 2023-03-21T09:10:11Z

Had the same when training on 2 gpus, using just python finetune.py

Got it running on both using torchrun

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py

Make sure your settings are consistent (gradient accumulation, micro steps)
For instance with 2x3090 I can run it with batch size 128, micro batch size 64
then gradient accumulation is one.
Using stuff that is not divisible here can lead to longer training or shorter batch size than planned.

vsevolodl · 2023-03-22T05:35:40Z

I had the same issue with 2xA6000 setup. Forcing training on a single GPU fixed it.

Ludobico · 2023-03-23T04:58:12Z

Had the same when training on 2 gpus, using just python finetune.py

Got it running on both using torchrun

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py

Make sure your settings are consistent (gradient accumulation, micro steps) For instance with 2x3090 I can run it with batch size 128, micro batch size 64 then gradient accumulation is one. Using stuff that is not divisible here can lead to longer training or shorter batch size than planned.

I fixed the issue above code
thanks

kooshi · 2023-03-23T05:11:13Z

For everyone dealing with this, it's because BitsAndBytes doesn't play nice with Trainer when it tries to do DataParallelism.

We're not actually missing out, as DataParallelism is quite slow, and, as referenced above, finetune.py supports DistributedDataParallelism, which you can achieve with torchrun, and is much faster.

I also just submitted a PR to run this with Model Parallelism, so you can use multiple GPUs to run larger models.

#131

Qubitium · 2023-03-23T09:45:32Z

@kooshi Would this PR allow pipeline parallelism for inference on llama as well? Would it be possible to have a parallel sample for generate.py?

kooshi · 2023-03-23T12:41:23Z

@kooshi Would this PR allow pipeline parallelism for inference on llama as well? Would it be possible to have a parallel sample for generate.py?

That's a good question. I haven't tried it, but I think it should at least run. If you have more than two GPUs, take the whole block from under if PIPE_CHUNKS > 0 and add it to the inference code.

I'm not sure if pipelining even helps with inference though

ddingwang12 · 2023-03-24T08:28:10Z

Had the same when training on 2 gpus, using just python finetune.py

Got it running on both using torchrun

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py

Make sure your settings are consistent (gradient accumulation, micro steps) For instance with 2x3090 I can run it with batch size 128, micro batch size 64 then gradient accumulation is one. Using stuff that is not divisible here can lead to longer training or shorter batch size than planned.
I fixed the question
thanks

sfxworks · 2023-03-29T19:24:18Z

What if I have five?

Changing WORLD_SIZE=5, --nproc_per_node=5 and doing cuda_visibility to 0,1,2,3,4 gets me a segfault

100%|██████████| 1/1 [00:00<00:00, 291.86it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Map:   6%|▌         | 2994/49715 [00:09<02:25, 320.01 examples/s]Loading cached split indices for dataset at /root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-30ac7dbe29aff00a.arrow and /root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-5cbcf99c07ccbefe.arrow
Map:  39%|███▉      | 19328/49715 [01:13<02:01, 250.71 examples/s]Found cached dataset json (/root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████| 1/1 [00:00<00:00, 218.68it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-30ac7dbe29aff00a.arrow and /root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-5cbcf99c07ccbefe.arrow
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 9 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 6) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
==================================================
finetune.py FAILED
--------------------------------------------------
Failures:
[1]:
  time      : 2023-03-29_19:19:20
  host      : finetune-job-7mpzm
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 7)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 7
[2]:
  time      : 2023-03-29_19:19:20
  host      : finetune-job-7mpzm
  rank      : 4 (local_rank: 4)
  exitcode  : -7 (pid: 10)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 10
--------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-29_19:19:20
  host      : finetune-job-7mpzm
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 6)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 6

edit, actually I get this with just one

yangYJT · 2023-05-24T14:53:29Z

@sfxworks yes. me too. Have you solved it?

brando90 · 2023-07-11T02:46:32Z

I get a similar issue with falcon but not on their official colab:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /lfs/hyperturing1/0/brando9/miniconda/envs/data_quality did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda-11.7/lib64')}
  warn(msg)
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/cuda-11.7/lib64: did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Updated by package ocaml')}
  warn(msg)
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('FILE')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
mode='disabled'
run=
report_to='none'
{'report_to': 'none', 'path2config': '/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/wandb_uu/sweep_configs/debug_config.yaml', 'program': '~/ultimate-utils/ultimate-utils-proj-src/uutils/wandb_uu/sweeps_common.py', 'project': 'playground', 'entity': 'brando', 'name': 'debug-logging-to-wandb-plataform-test', 'description': 'debug-not-logging-to-wandb-plataform-test', 'metric': {'name': 'train_loss', 'goal': 'minimize'}, 'method': 'random', 'optimizer': 'nadam', 'scheduler': 'cosine', 'lr': 0.0001, 'batch_size': 32, 'num_its': 2, 'run_cap': 1}
Found cached dataset json (/lfs/hyperturing1/0/brando9/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Found cached dataset json (/lfs/hyperturing1/0/brando9/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00,  1.61s/it]
Loading cached processed dataset at /lfs/hyperturing1/0/brando9/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-889fee109929377a.arrow
  0%|                                                                                                                           | 0/500 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/pdb.py", line 1723, in main
    pdb._runscript(mainpyfile)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/pdb.py", line 1583, in _runscript
    self.run(statement)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/bdb.py", line 598, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/afs/cs.stanford.edu/u/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/main_falcon_uu.py", line 34, in <module>
    main_falcon()
  File "/afs/cs.stanford.edu/u/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/main_falcon_uu.py", line 21, in main_falcon
    train(args)
  File "/afs/cs.stanford.edu/u/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/train/sft/qlora_ft.py", line 58, in train_falcon
    trainer.train()
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/trainer.py", line 2759, in training_step
    loss = self.compute_loss(model, inputs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/trainer.py", line 2784, in compute_loss
    outputs = model(**inputs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-7b/2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5/modelling_RW.py", line 753, in forward
    transformer_outputs = self.transformer(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-7b/2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5/modelling_RW.py", line 648, in forward
    outputs = block(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-7b/2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5/modelling_RW.py", line 385, in forward
    attn_outputs = self.self_attention(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-7b/2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5/modelling_RW.py", line 242, in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2048x4544 and 1x10614784)
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/peft/tuners/lora.py(565)forward()
-> result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)

NewEricWang · 2023-07-11T02:47:18Z

I had the same issue with 2xA6000 setup. Forcing training on a single GPU fixed it.

How do you force training on a single GPU when running "python finetune.py "?

NewEricWang · 2023-07-11T02:57:56Z

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In my case, it don't work. I add the above code into "finetune.py". When running "python finetune.py", all GPU still are used.

kooshi mentioned this issue Mar 23, 2023

Enabling model parallelism (training 30b on 2x 3090s and beyond) #131

Merged

AngainorDev mentioned this issue Mar 24, 2023

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x6 and 7x4096) #144

Closed

Jeffwan mentioned this issue Mar 27, 2023

Encounter issues in distributed finetune training #188

Closed

QinlongHuang mentioned this issue Apr 2, 2023

Get stucked on multi GPUs training #250

Open

zzlgreat mentioned this issue Apr 13, 2023

4 x 4090 can not finetune 30B model #332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime error: mat1 and mat2 shapes cannot be multiplied #8

runtime error: mat1 and mat2 shapes cannot be multiplied #8

ijustloveses commented Mar 15, 2023

ijustloveses commented Mar 15, 2023

kooshi commented Mar 15, 2023

gururise commented Mar 17, 2023 •

edited

Loading

saimarpaka commented Mar 18, 2023

gururise commented Mar 18, 2023

AngainorDev commented Mar 21, 2023

vsevolodl commented Mar 22, 2023

Ludobico commented Mar 23, 2023

kooshi commented Mar 23, 2023

Qubitium commented Mar 23, 2023

kooshi commented Mar 23, 2023

ddingwang12 commented Mar 24, 2023

sfxworks commented Mar 29, 2023 •

edited

Loading

yangYJT commented May 24, 2023

brando90 commented Jul 11, 2023

NewEricWang commented Jul 11, 2023

NewEricWang commented Jul 11, 2023

runtime error: mat1 and mat2 shapes cannot be multiplied #8

runtime error: mat1 and mat2 shapes cannot be multiplied #8

Comments

ijustloveses commented Mar 15, 2023

ijustloveses commented Mar 15, 2023

kooshi commented Mar 15, 2023

gururise commented Mar 17, 2023 • edited Loading

saimarpaka commented Mar 18, 2023

gururise commented Mar 18, 2023

AngainorDev commented Mar 21, 2023

vsevolodl commented Mar 22, 2023

Ludobico commented Mar 23, 2023

kooshi commented Mar 23, 2023

Qubitium commented Mar 23, 2023

kooshi commented Mar 23, 2023

ddingwang12 commented Mar 24, 2023

sfxworks commented Mar 29, 2023 • edited Loading

yangYJT commented May 24, 2023

brando90 commented Jul 11, 2023

NewEricWang commented Jul 11, 2023

NewEricWang commented Jul 11, 2023

gururise commented Mar 17, 2023 •

edited

Loading

sfxworks commented Mar 29, 2023 •

edited

Loading