Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run time error #51351

Open
brightbsit opened this issue Jan 29, 2021 · 11 comments
Open

Run time error #51351

brightbsit opened this issue Jan 29, 2021 · 11 comments
Assignees
Labels
module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general module: dependency bug Problem is not caused by us, but caused by an upstream library we use module: wsl Related to Windows Subsystem for Linux triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@brightbsit
Copy link

brightbsit commented Jan 29, 2021

i'm using wsl2 and docker with rtx3080.
i got an runtime error. when i used rtx 1050ti, it doesn't happend with cuda 10.0 version of docker.
please help me

Traceback (most recent call last):
  File "oscar/run_captioning.py", line 884, in <module>
    main()
  File "oscar/run_captioning.py", line 863, in main
    global_step, avg_loss = train(args, train_dataset, val_dataset, model, tokenizer)
  File "oscar/run_captioning.py", line 434, in train
    outputs = model(**inputs)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 440, in forward
    return self.encode_forward(*args, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 448, in encode_forward
    encoder_history_states=encoder_history_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 271, in forward
    encoder_history_states=encoder_history_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 109, in forward
    history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 140, in forward
    head_mask, history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 82, in forward
    self_outputs = self.self(input_tensor, attention_mask, head_mask, history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 36, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/functional.py", line 1371, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @csarofeen @ptrblck @xwang233 @ngimel @peterjc123 @maxluk @nbcsm @guyang3532 @gunandrose4u @mszhanyi @skyline75489

@ezyang ezyang added high priority module: cuda Related to torch.cuda, and CUDA support in general module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 29, 2021
@ezyang ezyang added the module: dependency bug Problem is not caused by us, but caused by an upstream library we use label Jan 29, 2021
@ezyang
Copy link
Contributor

ezyang commented Jan 29, 2021

Sounds like a cublas problem...

@ngimel
Copy link
Collaborator

ngimel commented Jan 29, 2021

Please provide the script to reproduce the error.

@ngimel ngimel added the module: cublas Problem related to cublas support label Jan 29, 2021
@brightbsit
Copy link
Author

I tried lastest version of torch and 1.7.0 cuda / 1.2.0 torch and 10.0 cuda in docker.(python=3.7)

It just happened after i changed my gpu from 1050 to 3080.

full log is here.

(py37) apple@DESKTOP-4PPS67A:/mnt/d/oscar$ python oscar/run_captioning.py --model_name_or_path pre_trained/base-vg-labels/ep_67_588997 --do_train --do_lower_case --evaluate_during_training --add_od_labels --learning_rate 0.00003 --per_gpu_train_batch_size 64 --num_train_epochs 30 --save_steps 5000 --output_dir output/
2021-01-30 17:40:13,564 vlpretrain WARNING: Device: cuda, n_gpu: 1
2021-01-30 17:50:08,424 vlpretrain INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, add_od_labels=True, config_name='', data_dir='', device=device(type='cuda'), do_eval=False, do_lower_case=True, do_test=False, do_train=True, drop_out=0.1, eval_model_dir='', evaluate_during_training=True, gradient_accumulation_steps=1, img_feature_dim=2054, img_feature_type='frcnn', learning_rate=3e-05, length_penalty=1, logging_steps=20, loss_type='sfmx', mask_prob=0.15, max_gen_length=20, max_grad_norm=1.0, max_img_seq_length=50, max_masked_tokens=3, max_seq_a_length=40, max_seq_length=70, max_steps=-1, min_constraints_to_satisfy=2, model_name_or_path='pre_trained/base-vg-labels/ep_67_588997', n_gpu=1, no_cuda=False, num_beams=5, num_keep_best=1, num_labels=2, num_return_sequences=1, num_train_epochs=30, num_workers=4, output_dir='output/', output_hidden_states=False, output_mode='classification', per_gpu_eval_batch_size=64, per_gpu_train_batch_size=64, repetition_penalty=1, save_steps=5000, scheduler='linear', scst=False, seed=88, temperature=1, test_yaml='oscar/coco_caption/test.yaml', tokenizer_name='', top_k=0, top_p=1, train_yaml='oscar/coco_caption/train.yaml', use_cbs=False, val_yaml='oscar/coco_caption/val.yaml', warmup_steps=0, weight_decay=0.05)
/mnt/d/oscar/oscar/utils/misc.py:33: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  return yaml.load(fp)
2021-01-30 17:50:29,238 vlpretrain INFO: ***** Running training *****
INFO:vlpretrain:***** Running training *****
2021-01-30 17:50:29,248 vlpretrain INFO:   Num examples = 566747
INFO:vlpretrain:  Num examples = 566747
2021-01-30 17:50:29,256 vlpretrain INFO:   Num Epochs = 30
INFO:vlpretrain:  Num Epochs = 30
2021-01-30 17:50:29,264 vlpretrain INFO:   Batch size per GPU = 64
INFO:vlpretrain:  Batch size per GPU = 64
2021-01-30 17:50:29,273 vlpretrain INFO:   Total train batch size (w. parallel, & accumulation) = 64
INFO:vlpretrain:  Total train batch size (w. parallel, & accumulation) = 64
2021-01-30 17:50:29,283 vlpretrain INFO:   Gradient Accumulation steps = 1
INFO:vlpretrain:  Gradient Accumulation steps = 1
2021-01-30 17:50:29,292 vlpretrain INFO:   Total optimization steps = 265680
INFO:vlpretrain:  Total optimization steps = 265680
Traceback (most recent call last):
  File "oscar/run_captioning.py", line 884, in <module>
    main()
  File "oscar/run_captioning.py", line 863, in main
    global_step, avg_loss = train(args, train_dataset, val_dataset, model, tokenizer)
  File "oscar/run_captioning.py", line 434, in train
    outputs = model(**inputs)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 440, in forward
    return self.encode_forward(*args, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 448, in encode_forward
    encoder_history_states=encoder_history_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 271, in forward
    encoder_history_states=encoder_history_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 109, in forward
    history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 140, in forward
    head_mask, history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 82, in forward
    self_outputs = self.self(input_tensor, attention_mask, head_mask, history_state)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 36, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/functional.py", line 1371, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

@peterjc123
Copy link
Collaborator

As far as I know, RTX3080 requires CUDA >= 11.0. Please update the CUDA installation and possibly the GPU driver.

@brightbsit
Copy link
Author

@peterjc123
Copy link
Collaborator

@brightbsit Sorry, I'm not so familiar with the PyTorch Docker containers. Does it has PyTorch installed? Which CUDA version is it builds for? To make sure, could you please install PyTorch through the command pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html?

@brightbsit
Copy link
Author

@peterjc123 I tried it and anther error happened..

2021-01-30 18:55:53,979 vlpretrain WARNING: Device: cuda, n_gpu: 1
Traceback (most recent call last):
  File "oscar/run_captioning.py", line 884, in <module>
    main()
  File "oscar/run_captioning.py", line 847, in main
    from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
  File "/mnt/d/oscar/transformers/pytorch_transformers/modeling_utils.py", line 450, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 427, in __init__
    self.bert = BertImgModel(config)
  File "/mnt/d/oscar/oscar/modeling/modeling_bert.py", line 153, in __init__
    self.embeddings = BertEmbeddings(config)
  File "/mnt/d/oscar/transformers/pytorch_transformers/modeling_bert.py", line 253, in __init__
    self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File "/home/apple/anaconda3/envs/py37/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 670, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 583, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1043, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/apple/anaconda3/envs/py37/lib/python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

@peterjc123
Copy link
Collaborator

@brightbsit Looks like you'll need to rebuild apex.

@ngimel
Copy link
Collaborator

ngimel commented Feb 1, 2021

I'm removing hi pri, looks like it's build issue.

@Blackhex Blackhex added module: wsl Related to Windows Subsystem for Linux and removed module: windows Windows support for PyTorch labels Jan 11, 2023
@cristianPanaite
Copy link
Collaborator

Removed module: windows and added module: wsl because it seems like this issue is encountered on a wsl system.

@iremyux
Copy link
Collaborator

iremyux commented Aug 22, 2023

Since the Windows label was removed, I am also removing it from 'Pytorch on Windows' project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general module: dependency bug Problem is not caused by us, but caused by an upstream library we use module: wsl Related to Windows Subsystem for Linux triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants