Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yolov5 v3.1 Resume Training #11385

Closed
1 of 2 tasks
Aagamshah9 opened this issue Apr 18, 2023 · 6 comments
Closed
1 of 2 tasks

Yolov5 v3.1 Resume Training #11385

Aagamshah9 opened this issue Apr 18, 2023 · 6 comments
Labels
bug Something isn't working Stale

Comments

@Aagamshah9
Copy link

Aagamshah9 commented Apr 18, 2023

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

I am currently training Object365 YOLOv5m model using the environment setting and requirements specified by v3.1 tag of YOLOv5 repo. The reason for using this specific version is because my entire ecosystem and deployment on several edge devices can only support v3.1. So, upgradation is not an option for me here. The issue is whenever the training is interrupted and when I try to resume it back I get the following error and have not been able to resume my training ever which is a critical issue now. I also tried the option of deleting the backup dir, renaming it and commenting out the line responsible for creating the backup dir. But none of this helps.
Kindly please look into this. Also please let me know if you need any additional information from my end. I would be happy to share.
Screenshot from 2023-04-18 10-56-30

Environment

  • YOLOv5 v3.1 commit 702c4fa
  • OS Ubuntu 20.04
  • Python 3.8
  • Other dependencies are

absl-py==1.4.0
cachetools==5.3.0
certifi==2022.12.7
charset-normalizer==3.0.1
coloredlogs==15.0.1
contourpy==1.0.7
cycler==0.11.0
Cython==0.29.33
flatbuffers==23.1.21
fonttools==4.38.0
future==0.18.3
google-auth==2.16.0
google-auth-oauthlib==0.4.6
grpcio==1.51.1
humanfriendly==10.0
idna==3.4
importlib-metadata==6.0.0
kiwisolver==1.4.4
Markdown==3.4.1
MarkupSafe==2.1.2
matplotlib==3.2.2
mpmath==1.2.1
numpy==1.19.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.2
onnxruntime==1.6.0
onnxruntime-gpu==1.13.1
opencv-python==4.1.2.30
packaging==23.0
Pillow==7.2.0
protobuf==3.20.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.9
python-dateutil==2.8.2
PyYAML==6.0
requests==2.28.2
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.10.0
six==1.16.0
sympy==1.11.1
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
torch==1.6.0
torchvision==0.7.0
tqdm==4.64.1
typing-extensions==4.4.0
unzip==1.0.0
urllib3==1.26.14
Werkzeug==2.2.2
zipp==3.12.1

Minimal Reproducible Example

Resume

start_epoch, best_fitness = 0, 0.0
if pretrained:
    # Optimizer
    if ckpt['optimizer'] is not None:
        optimizer.load_state_dict(ckpt['optimizer'])
        best_fitness = ckpt['best_fitness']

    # Results
    if ckpt.get('training_results') is not None:
        with open(results_file, 'w') as file:
            file.write(ckpt['training_results'])  # write results.txt

    # Epochs
    start_epoch = ckpt['epoch'] + 1
    if opt.resume:
        assert start_epoch > 0, '%s training to %g epochs is finished, nothing to resume.' % (weights, epochs)
        shutil.copytree(wdir, wdir.parent / f'weights_backup_epoch{start_epoch - 1}') # save previous weights
    if epochs < start_epoch:
        logger.info('%s has been trained for %g epochs. Fine-tuning for %g additional epochs.' %
                    (weights, ckpt['epoch'], epochs))
        epochs += ckpt['epoch']  # finetune additional epochs

    del ckpt, state_dict

Additional

After a little of further debugging and few print statements I found that the issue lies in the following chunk of code, the code never crosses beyond this during RESUME:

DDP mode

if cuda and rank != -1:
    model = DDP(model, device_ids=[opt.local_rank], output_device=opt.local_rank)

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@Aagamshah9 Aagamshah9 added the bug Something isn't working label Apr 18, 2023
@glenn-jocher
Copy link
Member

@Aagamshah9 thanks for reaching out! It seems like you are experiencing issues when trying to resume training for an Object365 YOLOv5m model using the v3.1 tag of YOLOv5 repo. It looks like the error message is prompting you to delete your old checkpoints, which you mentioned you've tried, but I'm not entirely sure what other steps you’ve taken based on the information provided. Can you please provide additional information on how you are trying to resume the training and any other relevant logs or error messages you are seeing? Also, have you tried training your model using the latest version of YOLOv5 to see if the issue persists or not? Please let me know so I can better assist you!

@Aagamshah9
Copy link
Author

Aagamshah9 commented Apr 18, 2023

Hi @glenn-jocher Thank you so much for your prompt response. I was training Object365 model for 300 epochs and I received the following error
unable to determine the device handle for gpu 0000:68:00.0: unknown error
due to which I had to restart the system and hence resume the training.
The command I use to resume my training is as follows:

python3 -m torch.distributed.launch --nproc_per_node 2 train.py --resume ./runs/exp8_Object365/weights/last.pt

But as I mentioned that whenever I resume my training I received the error I mentioned which is [Error 17]: File exists...So i tried deleting those but whenever I run it again it creates the backup folder again so I modified the script as below:

Epochs

    start_epoch = ckpt['epoch'] + 1
    if opt.resume:
        assert start_epoch > 0, '%s training to %g epochs is finished, nothing to resume.' % (weights, epochs)
        backup_dir = wdir.parent / f'weights_backup_epoch{start_epoch - 1}'
        if backup_dir.exists():
            print(f"Backup directory {backup_dir} already exists. Skipping backup...")
        else:
            shutil.copytree(wdir, backup_dir)
        # shutil.copytree(wdir, wdir.parent / f'weights_backup_epoch{start_epoch - 1}') # save previous weights
    if epochs < start_epoch:
        logger.info('%s has been trained for %g epochs. Fine-tuning for %g additional epochs.' %
                    (weights, ckpt['epoch'], epochs))
        epochs += ckpt['epoch']  # finetune additional epochs

    del ckpt, state_dict

So this issue was taken care of but after that I added few print statements to check that where the code stops working and upon doing that I found that the code never goes beyond the following chunk:

DDP mode

if cuda and rank != -1:
    model = DDP(model, device_ids=[opt.local_rank], output_device=opt.local_rank)

and gives me the following error:

Traceback (most recent call last):
  File "train.py", line 460, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 169, in train
    model = DDP(model, device_ids=[opt.local_rank], output_device=opt.local_rank)
  File "/home/aiuser/Stork/stork/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/aiuser/Stork/stork/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/aiuser/Stork/stork/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Socket Timeout

Also, I have tried training the model using the latest version of YOLOv5 and it works like charm but as I mentioned that my entire ecosystem and deployment pipeline for edge devices is based on the v3.1 tag because the model exported to onnx using torch==1.6.0 , torchvision==0.7.0 and onnxruntime==1.6.0 can only be run on edge devices. Our current deployment on edge devices does not have capability to execute SiLU activation function which is included in versions greater than v3.1.

@glenn-jocher
Copy link
Member

@Aagamshah9 thank you for providing additional context around your issue. Based on the error message you shared, it looks like there is an issue with communication between processes in your distributed training setup. Socket timeouts during Torch DDP training usually mean there is some instability in your network or high latency, which can be an indication of some type of bottleneck. There could be a number of reasons why this is happening, such as issues with your network configuration or limitations in the hardware you are using. To help you diagnose this further, I would recommend checking the logs for warnings or errors related to network connectivity, checking system resources (CPU, GPU, memory, disk) to see if there are any issues there, and making sure that your network is configured properly for distributed training. You could also try increasing the timeout to see if it helps, by setting the timeout argument in the DistributedDataParallel call. Once you isolate the issue, you can further troubleshoot it or seek appropriate help.

@Aagamshah9
Copy link
Author

@glenn-jocher Thank you so much for providing me the correct direction and guiding me to solve this issue. I would definitely look into the approach you mentioned and try to execute it and see if it resolves the issue or not. Once I am able to isolate this issue from the rest, I should be able to debug it. Thank you so much once again for your prompt response and help. I highly appreciate it.

@glenn-jocher
Copy link
Member

You're welcome @Aagamshah9! Feel free to reach out if you have any other questions or need further assistance. Good luck with your training!

@github-actions
Copy link
Contributor

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale label May 20, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

2 participants