Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when fine-tuning the model from huggingface hub #26

Closed
ioana-blue opened this issue Feb 3, 2021 · 7 comments
Closed

Issue when fine-tuning the model from huggingface hub #26

ioana-blue opened this issue Feb 3, 2021 · 7 comments

Comments

@ioana-blue
Copy link

ioana-blue commented Feb 3, 2021

Thanks for making the model available in huggingface hub. I tried to use it with some existing code I have. I've been running the same code with some 10+ models from huggingface hub with no issue. When I try to run with: "vinai/bertweet-base"

I get the following error (note model loads fine and it seems it starts training for several iterations) - see below.

I'm not sure what the problem could be. Could the version of transformers and/or pytorch be the problem? Do you know which versions you tried it with? I'm using transformers 3.4 and torch 1.5.1+cu101

Thanks for your help!

| 44/1923 [00:11<08:11,  3.82it/s]Traceback (most recent call last):
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 284, in <module>
    main()
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 195, in main
    trainer.train(
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 756, in train
    tr_loss += self.training_step(model, inputs)
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/transformers/trainer.py", line 1070, in training_step
    loss.backward()
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/autograd/__init__.py", line 98, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered (launch_kernel at /pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:217)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x2b9e5852d536 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xd43696 (0x2b9e2155e696 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const&) + 0x19e1 (0x2b9e2251ce11 in /u/ioana/.conda/envs/tr34/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
@ioana-blue
Copy link
Author

It's not a gpu problem. I tried running on the cpu, it also crashes with the following:

***** Running training *****
  Num examples = 15383
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1923
  0%|                                                                                                                      | 0/1923 [00:00<?, ?it/s]terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error

@ioana-blue
Copy link
Author

I upgraded to latest pytorch (1.7.1), same issue.

@datquocnguyen
Copy link
Member

datquocnguyen commented Feb 4, 2021

Please can you try a newer transformers version ?

@datquocnguyen
Copy link
Member

I have no idea what happened.
You might also try to delete/remove BERTweet from your transformers folder in ~/.cache/torch, so it'd automatically re-download BERTweet properly.

@ioana-blue
Copy link
Author

Sure, I can try that as well. Meanwhile, I ran in interactive mode on a gpu and I managed to get better errors (haven't looked into why this happens):

Traceback (most recent call last):
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 284, in <module>
    main()
  File "../models/jigsaw/tr-3.4//run_puppets.py", line 195, in main
    trainer.train(
  File "/dccstor/redrug_ier/envs/attack/lib/python3.8/site-packages/transformers/trainer.py", line 756, in train
    tr_loss += self.training_step(model, inputs)
  File "/dccstor/redrug_ier/envs/attack/lib/python3.8/site-packages/transformers/trainer.py", line 1056, in training_step
    loss = self.compute_loss(model, inputs)
  File "/dccstor/redrug_ier/envs/attack/lib/python3.8/site-packages/transformers/trainer.py", line 1080, in compute_loss
    outputs = model(**inputs)
  File "/dccstor/redrug_ier/envs/attack/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dccstor/redrug_ier/envs/attack/lib/python3.8/site-packages/transformers/modeling_roberta.py", line 990, in forward
    outputs = self.roberta(
  File "/dccstor/redrug_ier/envs/attack/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dccstor/redrug_ier/envs/attack/lib/python3.8/site-packages/transformers/modeling_roberta.py", line 674, in forward
    embedding_output = self.embeddings(
  File "/dccstor/redrug_ier/envs/attack/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dccstor/redrug_ier/envs/attack/lib/python3.8/site-packages/transformers/modeling_roberta.py", line 121, in forward
    embeddings = inputs_embeds + position_embeddings + token_type_embeddings
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [616,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

@datquocnguyen
Copy link
Member

I am not sure the error comes from BERTweet: indexSelectLargeIndex: block: [616,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.

@ioana-blue
Copy link
Author

I figured out what the problem is. I was running fine tuning with a max_seq_length of 512 while the BERTweet model was trained with 130. Once I used sequence length less than 130, it worked. I asked for a feature request for transformers to assert the seq size is less than max_position_embedding. See huggingface/transformers#10015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants