Run time error #51351

brightbsit · 2021-01-29T15:03:43Z

i'm using wsl2 and docker with rtx3080.
i got an runtime error. when i used rtx 1050ti, it doesn't happend with cuda 10.0 version of docker.
please help me

Traceback (most recent call last):
  File "oscar/run_captioning.py", line 884, in <module>
    main()
  File "oscar/run_captioning.py", line 863, in main
    global_step, avg_loss = train(args, train_dataset, val_dataset, model, tokenizer)
  File "oscar/run_captioning.py", line 434, in train
    outputs = model(**inputs)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 440, in forward
    return self.encode_forward(*args, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 448, in encode_forward
    encoder_history_states=encoder_history_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 271, in forward
    encoder_history_states=encoder_history_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 109, in forward
    history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 140, in forward
    head_mask, history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 82, in forward
    self_outputs = self.self(input_tensor, attention_mask, head_mask, history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 36, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/functional.py", line 1371, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @csarofeen @ptrblck @xwang233 @ngimel @peterjc123 @maxluk @nbcsm @guyang3532 @gunandrose4u @mszhanyi @skyline75489

The text was updated successfully, but these errors were encountered:

ezyang · 2021-01-29T16:22:15Z

Sounds like a cublas problem...

ngimel · 2021-01-29T17:21:15Z

Please provide the script to reproduce the error.

brightbsit · 2021-01-30T08:53:31Z

I tried lastest version of torch and 1.7.0 cuda / 1.2.0 torch and 10.0 cuda in docker.(python=3.7)

It just happened after i changed my gpu from 1050 to 3080.

full log is here.

(py37) apple@DESKTOP-4PPS67A:/mnt/d/oscar$ python oscar/run_captioning.py --model_name_or_path pre_trained/base-vg-labels/ep_67_588997 --do_train --do_lower_case --evaluate_during_training --add_od_labels --learning_rate 0.00003 --per_gpu_train_batch_size 64 --num_train_epochs 30 --save_steps 5000 --output_dir output/
2021-01-30 17:40:13,564 vlpretrain WARNING: Device: cuda, n_gpu: 1
2021-01-30 17:50:08,424 vlpretrain INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, add_od_labels=True, config_name='', data_dir='', device=device(type='cuda'), do_eval=False, do_lower_case=True, do_test=False, do_train=True, drop_out=0.1, eval_model_dir='', evaluate_during_training=True, gradient_accumulation_steps=1, img_feature_dim=2054, img_feature_type='frcnn', learning_rate=3e-05, length_penalty=1, logging_steps=20, loss_type='sfmx', mask_prob=0.15, max_gen_length=20, max_grad_norm=1.0, max_img_seq_length=50, max_masked_tokens=3, max_seq_a_length=40, max_seq_length=70, max_steps=-1, min_constraints_to_satisfy=2, model_name_or_path='pre_trained/base-vg-labels/ep_67_588997', n_gpu=1, no_cuda=False, num_beams=5, num_keep_best=1, num_labels=2, num_return_sequences=1, num_train_epochs=30, num_workers=4, output_dir='output/', output_hidden_states=False, output_mode='classification', per_gpu_eval_batch_size=64, per_gpu_train_batch_size=64, repetition_penalty=1, save_steps=5000, scheduler='linear', scst=False, seed=88, temperature=1, test_yaml='oscar/coco_caption/test.yaml', tokenizer_name='', top_k=0, top_p=1, train_yaml='oscar/coco_caption/train.yaml', use_cbs=False, val_yaml='oscar/coco_caption/val.yaml', warmup_steps=0, weight_decay=0.05)
/mnt/d/oscar/oscar/utils/misc.py:33: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  return yaml.load(fp)
2021-01-30 17:50:29,238 vlpretrain INFO: ***** Running training *****
INFO:vlpretrain:***** Running training *****
2021-01-30 17:50:29,248 vlpretrain INFO:   Num examples = 566747
INFO:vlpretrain:  Num examples = 566747
2021-01-30 17:50:29,256 vlpretrain INFO:   Num Epochs = 30
INFO:vlpretrain:  Num Epochs = 30
2021-01-30 17:50:29,264 vlpretrain INFO:   Batch size per GPU = 64
INFO:vlpretrain:  Batch size per GPU = 64
2021-01-30 17:50:29,273 vlpretrain INFO:   Total train batch size (w. parallel, & accumulation) = 64
INFO:vlpretrain:  Total train batch size (w. parallel, & accumulation) = 64
2021-01-30 17:50:29,283 vlpretrain INFO:   Gradient Accumulation steps = 1
INFO:vlpretrain:  Gradient Accumulation steps = 1
2021-01-30 17:50:29,292 vlpretrain INFO:   Total optimization steps = 265680
INFO:vlpretrain:  Total optimization steps = 265680
Traceback (most recent call last):
  File "oscar/run_captioning.py", line 884, in <module>
    main()
  File "oscar/run_captioning.py", line 863, in main
    global_step, avg_loss = train(args, train_dataset, val_dataset, model, tokenizer)
  File "oscar/run_captioning.py", line 434, in train
    outputs = model(**inputs)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 440, in forward
    return self.encode_forward(*args, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 448, in encode_forward
    encoder_history_states=encoder_history_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 271, in forward
    encoder_history_states=encoder_history_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 109, in forward
    history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 140, in forward
    head_mask, history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 82, in forward
    self_outputs = self.self(input_tensor, attention_mask, head_mask, history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 36, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/functional.py", line 1371, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

peterjc123 · 2021-01-30T09:06:24Z

As far as I know, RTX3080 requires CUDA >= 11.0. Please update the CUDA installation and possibly the GPU driver.

brightbsit · 2021-01-30T09:11:47Z

I build docker image with this link.
https://hub.docker.com/layers/pytorch/pytorch/latest/images/sha256-836b33ede9d7a0ee40e9a6bfe441d6b89d2fa5b0fdcd8f3aee7e70d91fca6e70?context=explore

So I can say i tried with cuda >= 11.0

peterjc123 · 2021-01-30T09:29:12Z

@brightbsit Sorry, I'm not so familiar with the PyTorch Docker containers. Does it has PyTorch installed? Which CUDA version is it builds for? To make sure, could you please install PyTorch through the command pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html?

brightbsit · 2021-01-30T09:57:14Z

@peterjc123 I tried it and anther error happened..

2021-01-30 18:55:53,979 vlpretrain WARNING: Device: cuda, n_gpu: 1
Traceback (most recent call last):
  File "oscar/run_captioning.py", line 884, in <module>
    main()
  File "oscar/run_captioning.py", line 847, in main
    from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
  File "/mnt/d/oscar/transformers/pytorch_transformers/modeling_utils.py", line 450, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 427, in __init__
    self.bert = BertImgModel(config)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 153, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/mnt/d/oscar/transformers/pytorch_transformers/modeling_bert.py", line 253, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 670, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 583, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1043, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

peterjc123 · 2021-01-30T10:15:58Z

@brightbsit Looks like you'll need to rebuild apex.

ngimel · 2021-02-01T18:03:44Z

I'm removing hi pri, looks like it's build issue.

cristianPanaite · 2023-01-11T07:42:41Z

Removed module: windows and added module: wsl because it seems like this issue is encountered on a wsl system.

iremyux · 2023-08-22T13:39:50Z

Since the Windows label was removed, I am also removing it from 'Pytorch on Windows' project.

ezyang added high priority module: cuda Related to torch.cuda, and CUDA support in general module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 29, 2021

pytorch-probot bot added the triage review label Jan 29, 2021

ezyang added the module: dependency bug Problem is not caused by us, but caused by an upstream library we use label Jan 29, 2021

ngimel added the module: cublas Problem related to cublas support label Jan 29, 2021

vadimkantorov mentioned this issue Jan 30, 2021

CUDNN_STATUS_INTERNAL_ERROR and Illegal Memory Access on CUDA11.2 and CUDA10.2. Likely a faulty device #51382

Closed

ngimel removed the high priority label Feb 1, 2021

ezyang removed the triage review label Feb 1, 2021

iremyux assigned cristianPanaite Jan 10, 2023

Blackhex added module: wsl Related to Windows Subsystem for Linux and removed module: windows Windows support for PyTorch labels Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run time error #51351

Run time error #51351

brightbsit commented Jan 29, 2021 •

edited by pytorch-probot bot

ezyang commented Jan 29, 2021

ngimel commented Jan 29, 2021

brightbsit commented Jan 30, 2021

peterjc123 commented Jan 30, 2021

brightbsit commented Jan 30, 2021

peterjc123 commented Jan 30, 2021

brightbsit commented Jan 30, 2021

peterjc123 commented Jan 30, 2021

ngimel commented Feb 1, 2021

cristianPanaite commented Jan 11, 2023

iremyux commented Aug 22, 2023

Run time error #51351

Run time error #51351

Comments

brightbsit commented Jan 29, 2021 • edited by pytorch-probot bot

ezyang commented Jan 29, 2021

ngimel commented Jan 29, 2021

brightbsit commented Jan 30, 2021

peterjc123 commented Jan 30, 2021

brightbsit commented Jan 30, 2021

peterjc123 commented Jan 30, 2021

brightbsit commented Jan 30, 2021

peterjc123 commented Jan 30, 2021

ngimel commented Feb 1, 2021

cristianPanaite commented Jan 11, 2023

iremyux commented Aug 22, 2023

brightbsit commented Jan 29, 2021 •

edited by pytorch-probot bot