Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in run_vits_finetuning when starting to train #22

Closed
khof312 opened this issue Mar 22, 2024 · 2 comments
Closed

Error in run_vits_finetuning when starting to train #22

khof312 opened this issue Mar 22, 2024 · 2 comments

Comments

@khof312
Copy link

khof312 commented Mar 22, 2024

I am encountering an error when trying to run accelerate launch run_vits_finetuning.py. I get past the Weights & Biases authentication but then get stuck on training:

03/22/2024 19:57:22 - INFO - __main__ - ***** Running training *****
03/22/2024 19:57:22 - INFO - __main__ -   Num examples = 110
03/22/2024 19:57:22 - INFO - __main__ -   Num Epochs = 200
03/22/2024 19:57:22 - INFO - __main__ -   Instantaneous batch size per device = 16
03/22/2024 19:57:22 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 16
03/22/2024 19:57:22 - INFO - __main__ -   Gradient Accumulation steps = 1
03/22/2024 19:57:22 - INFO - __main__ -   Total optimization steps = 1400
Steps:   0%|                                                                                | 0/1400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
TypeError: BatchEncoding.to() got an unexpected keyword argument 'non_blocking'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/finetune-hf-vits/run_vits_finetuning.py", line 1494, in <module>
    main()
  File "/content/finetune-hf-vits/run_vits_finetuning.py", line 1090, in main
    for step, batch in enumerate(train_dataloader):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 461, in __iter__
    current_batch = send_to_device(current_batch, self.device)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 157, in send_to_device
    return tensor.to(device)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 789, in to
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 789, in <dictcomp>
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
AttributeError: 'NoneType' object has no attribute 'to'

I am relatively new to all of this, so any ideas on where the problem might be would be really helpful. It looks to me like the training data is not loading successfully but I'm trying to figure out if maybe there is an issue in the config? For context:

  • I am trying to run everything locally so I don't push to the hub.
  • I don't need the wandb functionality, but the problem persists whether or not I visualize
  • This problem appears on colab but also locally when running WSL, same error.

I have put a reproducible example in this colab notebook which also includes the configs I'm using. To get started I was just trying to reproduce the Gujarati training example. Any pointers to where I'm going wrong would be greatly appreciated!

@oza75
Copy link

oza75 commented Mar 30, 2024

I was able to fix this by using the exact same version of the transformers, datasets and accelerate as mentionned in the requirements.txt file.

pip uninstall transformers datasets accelerate # remove the ones installed when you run pip install -r requirements.txt

pip install transformers==4.35.1 datasets[audio]==2.14.7 accelerate==0.24.1

@khof312
Copy link
Author

khof312 commented Mar 31, 2024

Thank you SO much, that worked for me as well! I will close this issue but perhaps if you don't mind @ylacombe, I will open a pull request to change requirements.txt to hard code the versions. I can verify at least that transformers==4.38.2 datasets==2.18.0 accelerate==0.28.0 and transformers==4.37.2 datasets==2.18.0 accelerate==0.28.0 combinations were not working for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants