Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HUB session with DDP training #13103

Merged
merged 77 commits into from
Jun 23, 2024
Merged

Fix HUB session with DDP training #13103

merged 77 commits into from
Jun 23, 2024

Conversation

Laughing-q
Copy link
Member

@Laughing-q Laughing-q commented May 24, 2024

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Improved distributed training and HUB session handling in the Ultralytics training workflow. πŸ› οΈ

πŸ“Š Key Changes

  • Added Distributed Training Synchronization: Introduced the torch_distributed_zero_first decorator to ensure all distributed processes wait for the main process to complete certain tasks.
  • Enhanced Model Loading: Incorporated the usage of torch_distributed_zero_first to prevent multiple auto-downloads of the dataset in distributed settings.
  • Initialized HUB Sessions: Added HUB session initialization in the trainer.
  • Adjusted HUB Authentication: Refined HUB model URL checking to ensure correct handling of non-HUB model URLs.

🎯 Purpose & Impact

  • Efficiency in Distributed Training: Prevents redundant downloads, saving time and resources in distributed environments. 🌐
  • Stable Integrations: Ensures HUB sessions are properly authenticated and initialized, enhancing the reliability of model training logs. πŸ”’
  • Developer Experience: Simplifies the process for developers working in distributed training scenarios, making it more user-friendly and robust. πŸ§‘β€πŸ’»

Copy link

codecov bot commented May 24, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 4 lines in your changes missing coverage. Please review.

Project coverage is 35.53%. Comparing base (6872028) to head (9719119).
Report is 1 commits behind head on main.

❗ Current head 9719119 differs from pull request most recent head 1cfcced

Please upload reports for the commit 1cfcced to get more accurate results.

Files Patch % Lines
ultralytics/engine/trainer.py 0.00% 3 Missing ⚠️
ultralytics/hub/session.py 0.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (6872028) and HEAD (9719119). Click for more details.

HEAD has 1 upload less than BASE | Flag | BASE (6872028) | HEAD (9719119) | |------|------|------| |Benchmarks|2|1|
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #13103       +/-   ##
===========================================
- Coverage   70.40%   35.53%   -34.88%     
===========================================
  Files         124      124               
  Lines       15905    15886       -19     
===========================================
- Hits        11198     5645     -5553     
- Misses       4707    10241     +5534     
Flag Coverage Ξ”
Benchmarks 35.53% <33.33%> (-0.26%) ⬇️
GPU ?
Tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

β˜” View full report in Codecov by Sentry.
πŸ“’ Have feedback on the report? Share it here.

@Laughing-q
Copy link
Member Author

Laughing-q commented May 24, 2024

@glenn-jocher I'm not really familiar with the whole workflow between hub and ultralytics but I figured we can directly load model from hub so I had to keep some code of hub-session in model.py, in case not to break anything.

if self.is_hub_model(model):
# Fetch model from HUB
checks.check_requirements("hub-sdk>=0.0.6")
self.session = self._get_hub_session(model)
model = self.session.model_file

https://github.com/ultralytics/ultralytics/blob/654c37f09bc3b1e9d182a6f4ea315616bf14c643/ultralytics/engine/model.py#L180-185

@Laughing-q
Copy link
Member Author

@glenn-jocher Also I tested in my local multi-gpu machine and it seems to work properly i.e it's trying to create hub-session in ddp training. That's so far I'm able to test since I don't have any hub env and account(I'm guessing it's same as wandb, which needs an account).
Waiting for @Burhan-Q to have some more tests. :)

@glenn-jocher
Copy link
Member

Got it, thanks @Laughing-q! @Burhan-Q can you test this PR for DDP training from HUB and then also from Ultralytics to HUB?

@Burhan-Q
Copy link
Member

@glenn-jocher @Laughing-q @sergiuwaxmann this worked with a model created from HUB

Here's the post training print out of training arguments.

model.session.train_args
>>> {
    'batch': -1, 
    'cache': 'ram', 
    'data': 'coco128.yaml', 
    'device': [0, 1],  # DDP enabled
    'epochs': 10, 
    'imgsz': 640, 
    'patience': 50, 
    'time': None
}
HUB DDP training report

image

Local log

from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful βœ…
>>> True

model = YOLO('https://hub.ultralytics.com/models/oljNmlCqCllzTUL5Jwwj')
results = model.train()

Ultralytics YOLOv8.2.20 πŸš€ Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
                                                            CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)
engine/trainer: task=detect, mode=train, model=yolov8s.pt, data=coco128.yaml, epochs=10, time=None, patience=50, batch=-1, imgsz=640, save=True, save_period=-1, cache=ram, device=[0, 1], workers=8, project=None, name=train, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=ultralytics/runs/detect/train

                   from  n    params  module                                       arguments                     
  0                  -1  1       928  ultralytics.nn.modules.conv.Conv             [3, 32, 3, 2]                 
  1                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  2                  -1  1     29056  ultralytics.nn.modules.block.C2f             [64, 64, 1, True]             
  3                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  4                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  5                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  6                  -1  2    788480  ultralytics.nn.modules.block.C2f             [256, 256, 2, True]           
  7                  -1  1   1180672  ultralytics.nn.modules.conv.Conv             [256, 512, 3, 2]              
  8                  -1  1   1838080  ultralytics.nn.modules.block.C2f             [512, 512, 1, True]           
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    591360  ultralytics.nn.modules.block.C2f             [768, 256, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 16                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 19                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1   1969152  ultralytics.nn.modules.block.C2f             [768, 512, 1]                 
 22        [15, 18, 21]  1   2147008  ultralytics.nn.modules.head.Detect           [80, [128, 256, 512]]         
Model summary: 225 layers, 11166560 parameters, 11166544 gradients, 28.8 GFLOPs

Transferred 355/355 items from pretrained weights

WARNING ⚠️ 'batch=-1' for AutoBatch is incompatible with Multi-GPU training, setting default 'batch=16'

DDP: debug command /home/burhan/ultra_repo/.ultra/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 54685 /home/burhan/.config/Ultralytics/DDP/_temp_o135mjif139927424095984.py

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

Ultralytics YOLOv8.2.20 πŸš€ Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
                                                            CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)

Ultralytics HUB: View model at https://hub.ultralytics.com/models/bqwlsS8e96a9fzpRATuW πŸš€
Transferred 355/355 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed βœ…
train: Scanning /home/shared/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<?, ?it/s]
train: Caching images (0.1GB RAM): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<00:00, 1948.18it/s]
val: Scanning /home/shared/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<?, ?it/s]
val: Caching images (0.1GB RAM): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [00:00<00:00, 984.03it/s]
Plotting labels to ultralytics/runs/detect/train/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.000714, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias(decay=0.0)

Image sizes 640 train, 640 val
Using 16 dataloader workers

Logging results to ultralytics/runs/detect/train

Starting training for 10 epochs...
Closing dataloader mosaic

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       1/10      2.37G      1.215      1.451      1.245         42        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:02<00:00,  3.45it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:01<00:00,  6.85it/s]
                   all        128        929      0.757      0.684       0.76      0.588

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       2/10      2.43G       1.21      1.482      1.245         48        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00,  8.90it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.82it/s]
                   all        128        929      0.748      0.665      0.764       0.58

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       3/10      2.44G      1.122      1.112      1.147         48        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00,  9.47it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.54it/s]
                   all        128        929      0.711      0.692      0.777      0.591

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       4/10      2.44G     0.9762     0.9867      1.098         62        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00,  9.60it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.94it/s]
                   all        128        929      0.771       0.71      0.797      0.623

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       5/10      2.49G     0.9425     0.9654      1.071         68        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00,  9.25it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 16.00it/s]
                   all        128        929      0.762      0.734      0.802      0.621

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       6/10      2.44G      1.026      0.899      1.084         31        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00,  9.77it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.68it/s]
                   all        128        929      0.826      0.733      0.809      0.629

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       7/10      2.48G     0.8806     0.8252      1.058         45        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00,  9.26it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.98it/s]
                   all        128        929      0.799      0.768      0.818      0.638

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       8/10      2.47G     0.8754      0.818      1.025         49        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00,  9.14it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.87it/s]
                   all        128        929      0.844      0.722      0.827      0.649

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       9/10      2.49G     0.9863     0.8815      1.152         36        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00,  9.62it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.99it/s]
                   all        128        929      0.883       0.72      0.836      0.663

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      10/10      2.45G       0.98     0.8182      1.072         39        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 10.12it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 15.65it/s]
                   all        128        929      0.867      0.736      0.841      0.666

10 epochs completed in 0.006 hours.
Optimizer stripped from ultralytics/runs/detect/train/weights/last.pt, 22.6MB
Optimizer stripped from ultralytics/runs/detect/train/weights/best.pt, 22.6MB

Validating ultralytics/runs/detect/train/weights/best.pt...
Ultralytics YOLOv8.2.20 πŸš€ Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
                                                            CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)
Model summary (fused): 168 layers, 11156544 parameters, 0 gradients, 28.6 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:02<00:00,  3.27it/s]
                   all        128        929      0.869      0.738      0.841      0.666
                person        128        254      0.964      0.632      0.853      0.652
               bicycle        128          6      0.801      0.333      0.537      0.312
                   car        128         46          1      0.306      0.591      0.311
            motorcycle        128          5      0.917          1      0.995       0.89
              airplane        128          6      0.964          1      0.995      0.907
                   bus        128          7          1      0.791      0.995      0.883
                 train        128          3      0.962          1      0.995      0.808
                 truck        128         12      0.961        0.5       0.69      0.434
                  boat        128          6      0.878      0.667      0.791      0.522
         traffic light        128         14          1      0.249      0.356      0.285
             stop sign        128          2      0.898          1      0.995      0.848
                 bench        128          9          1      0.733      0.833      0.603
                  bird        128         16          1      0.881      0.995      0.707
                   cat        128          4      0.926          1      0.995      0.891
                   dog        128          9      0.891      0.908      0.984      0.819
                 horse        128          2      0.896          1      0.995       0.75
              elephant        128         17          1      0.923      0.955      0.795
                  bear        128          1      0.695          1      0.995      0.895
                 zebra        128          4      0.936          1      0.995      0.959
               giraffe        128          9          1      0.956      0.995       0.85
              backpack        128          6          1      0.703      0.837      0.594
              umbrella        128         18      0.821      0.889       0.95      0.704
               handbag        128         19      0.821      0.421      0.598       0.44
                   tie        128          7          1      0.802      0.858      0.618
              suitcase        128          4          1      0.813      0.995      0.616
               frisbee        128          5       0.89        0.8      0.804      0.684
                  skis        128          1      0.781          1      0.995      0.796
             snowboard        128          7      0.569      0.714      0.763      0.579
           sports ball        128          6          1      0.554       0.67      0.442
                  kite        128         10       0.89        0.3      0.583      0.296
          baseball bat        128          4      0.668       0.25      0.565       0.43
        baseball glove        128          7      0.935      0.429      0.439        0.3
            skateboard        128          5      0.623          1      0.938       0.59
         tennis racket        128          7      0.745      0.571      0.607      0.401
                bottle        128         18          1      0.353      0.752      0.499
            wine glass        128         16          1      0.485      0.703      0.501
                   cup        128         36      0.849      0.806      0.857       0.58
                  fork        128          6      0.802      0.333      0.644      0.478
                 knife        128         16      0.856      0.625       0.79      0.584
                 spoon        128         22        0.9      0.411      0.636      0.484
                  bowl        128         28        0.9      0.821      0.872      0.707
                banana        128          1      0.732          1      0.995      0.995
              sandwich        128          2      0.863          1      0.995      0.995
                orange        128          4      0.556       0.75      0.702      0.524
              broccoli        128         11      0.616      0.295      0.558      0.382
                carrot        128         24      0.838      0.647      0.874      0.647
               hot dog        128          2      0.876          1      0.995      0.995
                 pizza        128          5      0.967          1      0.995      0.904
                 donut        128         14      0.695          1      0.936      0.862
                  cake        128          4      0.944          1      0.995      0.905
                 chair        128         35      0.765      0.559      0.742      0.545
                 couch        128          6      0.816      0.749      0.852      0.708
          potted plant        128         14      0.873      0.929      0.955      0.781
                   bed        128          3      0.908          1      0.995       0.94
          dining table        128         13          1      0.757      0.854      0.738
                toilet        128          2      0.965          1      0.995      0.896
                    tv        128          2      0.898          1      0.995      0.799
                laptop        128          3      0.932          1      0.995      0.907
                 mouse        128          2      0.588        0.5      0.545      0.413
                remote        128          8          1      0.638      0.861      0.662
            cell phone        128          8          1      0.542       0.63      0.442
             microwave        128          3       0.74          1      0.995      0.952
                  oven        128          5       0.76        0.6      0.665      0.479
                  sink        128          6      0.901        0.5      0.783      0.658
          refrigerator        128          5      0.935          1      0.995        0.8
                  book        128         29      0.708      0.241      0.625      0.418
                 clock        128          9       0.89        0.9      0.973      0.818
                  vase        128          2      0.631          1      0.995      0.995
              scissors        128          1          1          0      0.995      0.219
            teddy bear        128         21      0.838       0.81      0.862      0.625
            toothbrush        128          5      0.734          1      0.995      0.836

Speed: 0.1ms preprocess, 3.3ms inference, 0.0ms loss, 3.1ms postprocess per image
Results saved to ultralytics/runs/detect/train

Ultralytics HUB: Syncing final model...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 21.5M/21.5M [00:01<00:00, 12.3MB/s]
Ultralytics HUB: Done βœ…
Ultralytics HUB: View model at https://hub.ultralytics.com/models/bqwlsS8e96a9fzpRATuW πŸš€

@Burhan-Q Burhan-Q added bug Something isn't working HUB Ultralytics HUB issues labels May 24, 2024
@Burhan-Q
Copy link
Member

Note

This PR is related to ultralytics/hub#695 and ultralytics/hub#606

@Burhan-Q
Copy link
Member

Burhan-Q commented May 24, 2024

I attempted launching a local training while logged into my HUB account (both with and without DDP), but the HUB logging doesn't appear to work for either case on this branch.

from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful βœ…
>>> True

model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, device=6)

No model was uploaded after training completes

@Laughing-q
Copy link
Member Author

Laughing-q commented May 24, 2024

I attempted launching a local training while logged into my HUB account (both with and without DDP), but the HUB logging doesn't appear to work for either case on this branch.

from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful βœ…
>>> True

model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, device=6)

No model was uploaded after training completes

does this work properly with single-gpu mode on main branch?

@Burhan-Q
Copy link
Member

@Laughing-q yes it does work when I switch to main

@Laughing-q
Copy link
Member Author

@Burhan-Q that's strange...for me the training goes to the step of creating hub instance.
pic-240524-2118-56
again that's so far I'm able to debug.
@Burhan-Q I'm wondering perhaps you could add some printing here in the function to check if self.hub_session is created successfully?

@Laughing-q
Copy link
Member Author

oh I was supposed to post this one
NmrBul4E4J

@Burhan-Q
Copy link
Member

I tested out a modification to the trainer._setup_hub() method that gave me some interesting results

        if SETTINGS["hub"] and self.hub_session is None:
            # Create a model in HUB
            try:
                from ultralytics.hub.session import HUBTrainingSession

                session = HUBTrainingSession(self.args.model)
                self.hub_session = session if session.client.authenticated else self.hub_session
                if self.hub_session:
                    self.hub_session.create_model(self.args)
                    # Check model was created
                    if not self.hub_session.model:
                        self.hub_session = None
            except (PermissionError, ModuleNotFoundError):
                # Ignore PermissionError and ModuleNotFoundError which indicates hub-sdk not installed
                pass

With these changes, I suddenly get lots of these in the training log

hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Train log

from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful βœ…
>>> True

model = YOLO("yolov8s-seg.pt")
result = model.train(data="coco8-seg.yaml", epochs=10, device=3)

Ultralytics YOLOv8.2.20 πŸš€ Python-3.10.12 torch-2.2.0+cu121 CUDA:3 (NVIDIA A100-SXM4-80GB, 81051MiB)
engine/trainer: task=segment, mode=train, model=yolov8s-seg.pt, data=coco8-seg.yaml, epochs=10, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=3, workers=8, project=None, name=train6, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=/home/burhan/tests/ultralytics/runs/segment/train6

                   from  n    params  module                                       arguments                     
  0                  -1  1       928  ultralytics.nn.modules.conv.Conv             [3, 32, 3, 2]                 
  1                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  2                  -1  1     29056  ultralytics.nn.modules.block.C2f             [64, 64, 1, True]             
  3                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  4                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  5                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  6                  -1  2    788480  ultralytics.nn.modules.block.C2f             [256, 256, 2, True]           
  7                  -1  1   1180672  ultralytics.nn.modules.conv.Conv             [256, 512, 3, 2]              
  8                  -1  1   1838080  ultralytics.nn.modules.block.C2f             [512, 512, 1, True]           
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    591360  ultralytics.nn.modules.block.C2f             [768, 256, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 16                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 19                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1   1969152  ultralytics.nn.modules.block.C2f             [768, 512, 1]                 
 22        [15, 18, 21]  1   2801504  ultralytics.nn.modules.head.Segment          [80, 32, 128, [128, 256, 512]]
YOLOv8s-seg summary: 261 layers, 11821056 parameters, 11821040 gradients, 42.9 GFLOPs

Transferred 417/417 items from pretrained weights
2024-05-24 09:13:41,106 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
2024-05-24 09:13:41,109 - hub_sdk.helpers.logger - ERROR - Received no response from the server while creating the model.
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed βœ…
train: Scanning /home/shared/datasets/coco8-seg/labels/train.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<?, ?it/s]
val: Scanning /home/shared/datasets/coco8-seg/labels/val.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<?, ?it/s]
Plotting labels to /home/burhan/tests/ultralytics/runs/segment/train6/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.000119, momentum=0.9) with parameter groups 66 weight(decay=0.0), 77 weight(decay=0.0005), 76 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to /home/burhan/tests/ultralytics/runs/segment/train6
Starting training for 10 epochs...
Closing dataloader mosaic

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       1/10      1.52G     0.9476      2.702      1.978      1.283         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  1.31it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  4.19it/s]
                   all          4         17      0.822      0.898       0.94      0.679      0.822      0.898      0.939      0.592

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       2/10      1.54G     0.9262      2.695      2.596      1.247         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  9.71it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 14.64it/s]
                   all          4         17      0.832      0.905      0.941       0.68      0.832      0.905      0.941        0.6

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       3/10      1.57G     0.8526      2.626      2.046      1.234         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  8.52it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 17.86it/s]
                   all          4         17      0.838      0.913      0.942       0.68      0.838      0.913      0.935      0.599

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       4/10      1.57G      1.143      3.088       2.44      1.395         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  8.65it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 23.26it/s]
                   all          4         17      0.834      0.913       0.94      0.672      0.834      0.913      0.939      0.601

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       5/10      1.55G      1.009      2.831       2.98      1.271         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  3.74it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95):   0%|          | 0/1 [00:00<?, ?it/s]2024-05-24 09:13:50,859 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 17.41it/s]
                   all          4         17      0.836      0.912      0.941      0.683      0.836      0.912      0.932      0.599

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       6/10      1.59G      1.268      3.083       2.17      1.654         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  5.34it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 22.36it/s]
                   all          4         17      0.844      0.912      0.942      0.684      0.844      0.912      0.934      0.599

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       7/10      1.67G     0.7353      2.692      1.904      1.155         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  9.19it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 22.59it/s]
                   all          4         17      0.889      0.911      0.942      0.684      0.889      0.911      0.934      0.602

2024-05-24 09:13:52,195 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       8/10      1.67G      1.062      2.927      2.253      1.209         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  7.85it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 22.27it/s]
                   all          4         17      0.846      0.913      0.942      0.674      0.846      0.913      0.934      0.587

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       9/10      1.65G      1.024      2.655        2.2       1.32         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  7.58it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 23.05it/s]
                   all          4         17      0.873      0.915      0.942      0.675      0.873      0.915      0.934      0.587

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
      10/10      1.65G     0.6717      1.964      1.453      1.056         13        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  7.38it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 22.08it/s]
                   all          4         17      0.863      0.925      0.942      0.687      0.863      0.925      0.934      0.603

2024-05-24 09:13:54,152 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.

10 epochs completed in 0.002 hours.
2024-05-24 09:13:54,537 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Optimizer stripped from /home/burhan/tests/ultralytics/runs/segment/train6/weights/last.pt, 23.9MB
Optimizer stripped from /home/burhan/tests/ultralytics/runs/segment/train6/weights/best.pt, 23.9MB

Validating /home/burhan/tests/ultralytics/runs/segment/train6/weights/best.pt...
Ultralytics YOLOv8.2.20 πŸš€ Python-3.10.12 torch-2.2.0+cu121 CUDA:3 (NVIDIA A100-SXM4-80GB, 81051MiB)
YOLOv8s-seg summary (fused): 195 layers, 11810560 parameters, 0 gradients, 42.6 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 26.21it/s]
2024-05-24 09:13:55,403 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
                   all          4         17      0.863      0.924      0.942      0.671      0.863      0.924      0.934      0.587
                person          4         10      0.844      0.547      0.678      0.339      0.844      0.547      0.629      0.308
                   dog          4          1      0.737          1      0.995      0.895      0.737          1      0.995      0.895
                 horse          4          2      0.903          1      0.995       0.65      0.903          1      0.995      0.226
              elephant          4          2      0.946          1      0.995      0.448      0.946          1      0.995        0.4
              umbrella          4          1       0.75          1      0.995      0.895       0.75          1      0.995      0.895
          potted plant          4          1          1          1      0.995      0.796          1          1      0.995      0.796

2024-05-24 09:13:57,840 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
2024-05-24 09:13:58,935 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Speed: 0.1ms preprocess, 2.2ms inference, 0.0ms loss, 0.8ms postprocess per image
Results saved to /home/burhan/tests/ultralytics/runs/segment/train6
2024-05-24 09:14:00,430 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Ultralytics HUB: Syncing final model...
2024-05-24 09:14:01,795 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
2024-05-24 09:14:02,201 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 13.5M/22.8M [00:00<00:00, 15.2MB/s]2024-05-24 09:14:04,112 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 22.8M/22.8M [00:01<00:00, 14.2MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/model.py", line 660, in train
    self.trainer.train()
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/trainer.py", line 205, in train
    self._do_train(world_size)
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/trainer.py", line 468, in _do_train
    self.run_callbacks("on_train_end")
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/trainer.py", line 165, in run_callbacks
    callback(self)
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/utils/callbacks/hub.py", line 69, in on_train_end
    LOGGER.info(f"{PREFIX}Done βœ…\n" f"{PREFIX}View model at {session.model_url} πŸš€")
AttributeError: 'HUBTrainingSession' object has no attribute 'model_url'
>>> 2024-05-24 09:14:08,415 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.

@Laughing-q
Copy link
Member Author

I attempted launching a local training while logged into my HUB account (both with and without DDP), but the HUB logging doesn't appear to work for either case on this branch.

from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful βœ…
>>> True

model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, device=6)

No model was uploaded after training completes

@glenn-jocher @Burhan-Q @sergiuwaxmann Guys I think I found a bug of hub-sdk package here...when using HUB logging for local training.
HUB logging does not work properly when cache != ram here:

"cache": model_args.get("cache", "ram"),

which means HUB logging fails when we pass an arg to override cache(True or False) i.e using following script to launch a training locally on main branch won't get any logging on HUB:

from ultralytics import YOLO, hub
hub.login(API_KEY)
model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, cache=True)

meanwhile it throws hub error log:

2024-05-24 22:06:46,208 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
2024-05-24 22:06:46,210 - hub_sdk.helpers.logger - ERROR - Received no response from the server while creating the model

Based on the case, and the fact I updated the parameter of passing self.args to self.hub_session.create_model in this PR:

self.hub_session.create_model(self.args)

By default cache equals to False from self.args, hence no HUB logging.
It seems to me root is the cache issue from hub-sdk package.

@Laughing-q
Copy link
Member Author

Laughing-q commented May 24, 2024

I could easily fix the issue in this PR by excluding cache from self.args before passing it to self.hub_session.create_model, but it's not the root issue to me.πŸ€”

@glenn-jocher
Copy link
Member

@Laughing-q @Burhan-Q @sergiuwaxmann strange. I think this PR might need some extra study, if we rush a solution we might just end up with more bugs. What we need is a solution that will log to HUB correctly when training from both:

  • Starting from HUB (single-GPU and DDP)
  • Starting from ultralytics after logging in to HUB with yolo hub login (single and DDP)

We should really strive to implement the logging identically to W&B, which works correctly with our callbacks in all scenarios.

I don't have much time today but I will look into this this weekend.

@glenn-jocher glenn-jocher added the TODO Items that needs completing label May 26, 2024
@Laughing-q
Copy link
Member Author

Laughing-q commented Jun 23, 2024

@glenn-jocher I resolved the conflicts and eliminated the hub_model_url. And I tested all the cases(training locally or from HUB on both single-gpu mode and DDP mode) and all works properly.
Now this PR looks much cleaner. Thanks for the refactoring PR!

FYI I used to consider updating trainer.args.model to model_url directly without introducing a new attribute would cause issue to model initialization, but it turns out it wouldn't since the value of trainer.args.model has already been recorded to trainer.model in Trainer initialization, so it's free for us to use for model_url.

@glenn-jocher
Copy link
Member

@Laughing-q wow this is great, much simpler!

@glenn-jocher glenn-jocher removed the TODO Items that needs completing label Jun 23, 2024
@glenn-jocher glenn-jocher changed the title Attempt to fix HUB session with DDP training ultralytics 8.2.41 fix HUB session with DDP training Jun 23, 2024
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@glenn-jocher
Copy link
Member

@Laughing-q I'm getting errors when training DDP from HUB to local.

I see there are dataset download issues also, this appears to be happening twice, so I think we need to only autodownload datasets on RANK -1, 0, but that might be a separate issue.

Screenshot 2024-06-23 at 18 51 28

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: UltralyticsAssistant <web@ultralytics.com>
@glenn-jocher
Copy link
Member

@Laughing-q I've been testing this some more and something strange is happening on dataset download from HUB (both single and multi-GPU), where coco8 is not unzipping correctly to ../datasets/coco8, it's unzipping to ../datasets/coco8/coco8. I'm going to merge this PR and try to figure out what's happening on the dataset unzip issue.

@glenn-jocher glenn-jocher changed the title ultralytics 8.2.41 fix HUB session with DDP training Fix HUB session with DDP training Jun 23, 2024
@glenn-jocher glenn-jocher merged commit 1696024 into main Jun 23, 2024
12 of 13 checks passed
@glenn-jocher glenn-jocher deleted the hub-session branch June 23, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HUB Ultralytics HUB issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants