Fix HUB session with DDP training #13103

Laughing-q · 2024-05-24T12:05:28Z

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Improved distributed training and HUB session handling in the Ultralytics training workflow. 🛠️

📊 Key Changes

Added Distributed Training Synchronization: Introduced the torch_distributed_zero_first decorator to ensure all distributed processes wait for the main process to complete certain tasks.
Enhanced Model Loading: Incorporated the usage of torch_distributed_zero_first to prevent multiple auto-downloads of the dataset in distributed settings.
Initialized HUB Sessions: Added HUB session initialization in the trainer.
Adjusted HUB Authentication: Refined HUB model URL checking to ensure correct handling of non-HUB model URLs.

🎯 Purpose & Impact

Efficiency in Distributed Training: Prevents redundant downloads, saving time and resources in distributed environments. 🌐
Stable Integrations: Ensures HUB sessions are properly authenticated and initialized, enhancing the reliability of model training logs. 🔒
Developer Experience: Simplifies the process for developers working in distributed training scenarios, making it more user-friendly and robust. 🧑‍💻

codecov · 2024-05-24T12:07:08Z

Codecov Report

Attention: Patch coverage is 33.33333% with 4 lines in your changes missing coverage. Please review.

Project coverage is 35.53%. Comparing base (6872028) to head (9719119).
Report is 1 commits behind head on main.

❗ Current head 9719119 differs from pull request most recent head 1cfcced

Please upload reports for the commit 1cfcced to get more accurate results.

Files	Patch %	Lines
ultralytics/engine/trainer.py	0.00%	3 Missing ⚠️
ultralytics/hub/session.py	0.00%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (6872028) and HEAD (9719119). Click for more details.

HEAD has 1 upload less than BASE
| Flag | BASE (6872028) | HEAD (9719119) | |------|------|------| |Benchmarks|2|1|

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #13103       +/-   ##
===========================================
- Coverage   70.40%   35.53%   -34.88%     
===========================================
  Files         124      124               
  Lines       15905    15886       -19     
===========================================
- Hits        11198     5645     -5553     
- Misses       4707    10241     +5534

Flag	Coverage Δ
Benchmarks	`35.53% <33.33%> (-0.26%)`	⬇️
GPU	`?`
Tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Laughing-q · 2024-05-24T12:09:40Z

@glenn-jocher I'm not really familiar with the whole workflow between hub and ultralytics but I figured we can directly load model from hub so I had to keep some code of hub-session in model.py, in case not to break anything.

ultralytics/ultralytics/engine/model.py

Lines 136 to 140 in 654c37f

    
           if self.is_hub_model(model): 
        
               # Fetch model from HUB 
        
               checks.check_requirements("hub-sdk>=0.0.6") 
        
               self.session = self._get_hub_session(model) 
        
               model = self.session.model_file

https://github.com/ultralytics/ultralytics/blob/654c37f09bc3b1e9d182a6f4ea315616bf14c643/ultralytics/engine/model.py#L180-185

Laughing-q · 2024-05-24T12:12:42Z

@glenn-jocher Also I tested in my local multi-gpu machine and it seems to work properly i.e it's trying to create hub-session in ddp training. That's so far I'm able to test since I don't have any hub env and account(I'm guessing it's same as wandb, which needs an account).
Waiting for @Burhan-Q to have some more tests. :)

glenn-jocher · 2024-05-24T12:28:30Z

Got it, thanks @Laughing-q! @Burhan-Q can you test this PR for DDP training from HUB and then also from Ultralytics to HUB?

Burhan-Q · 2024-05-24T12:36:26Z

@glenn-jocher @Laughing-q @sergiuwaxmann this worked with a model created from HUB

Here's the post training print out of training arguments.

model.session.train_args
>>> {
    'batch': -1, 
    'cache': 'ram', 
    'data': 'coco128.yaml', 
    'device': [0, 1],  # DDP enabled
    'epochs': 10, 
    'imgsz': 640, 
    'patience': 50, 
    'time': None
}

HUB DDP training report

Local log

from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful ✅
>>> True

model = YOLO('https://hub.ultralytics.com/models/oljNmlCqCllzTUL5Jwwj')
results = model.train()

Ultralytics YOLOv8.2.20 🚀 Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
                                                            CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)
engine/trainer: task=detect, mode=train, model=yolov8s.pt, data=coco128.yaml, epochs=10, time=None, patience=50, batch=-1, imgsz=640, save=True, save_period=-1, cache=ram, device=[0, 1], workers=8, project=None, name=train, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=ultralytics/runs/detect/train

                   from  n    params  module                                       arguments                     
  0                  -1  1       928  ultralytics.nn.modules.conv.Conv             [3, 32, 3, 2]                 
  1                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  2                  -1  1     29056  ultralytics.nn.modules.block.C2f             [64, 64, 1, True]             
  3                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  4                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  5                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  6                  -1  2    788480  ultralytics.nn.modules.block.C2f             [256, 256, 2, True]           
  7                  -1  1   1180672  ultralytics.nn.modules.conv.Conv             [256, 512, 3, 2]              
  8                  -1  1   1838080  ultralytics.nn.modules.block.C2f             [512, 512, 1, True]           
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    591360  ultralytics.nn.modules.block.C2f             [768, 256, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 16                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 19                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1   1969152  ultralytics.nn.modules.block.C2f             [768, 512, 1]                 
 22        [15, 18, 21]  1   2147008  ultralytics.nn.modules.head.Detect           [80, [128, 256, 512]]         
Model summary: 225 layers, 11166560 parameters, 11166544 gradients, 28.8 GFLOPs

Transferred 355/355 items from pretrained weights

WARNING ⚠️ 'batch=-1' for AutoBatch is incompatible with Multi-GPU training, setting default 'batch=16'

DDP: debug command /home/burhan/ultra_repo/.ultra/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 54685 /home/burhan/.config/Ultralytics/DDP/_temp_o135mjif139927424095984.py

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

Ultralytics YOLOv8.2.20 🚀 Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
                                                            CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)

Ultralytics HUB: View model at https://hub.ultralytics.com/models/bqwlsS8e96a9fzpRATuW 🚀
Transferred 355/355 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
train: Scanning /home/shared/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
train: Caching images (0.1GB RAM): 100%|██████████| 128/128 [00:00<00:00, 1948.18it/s]
val: Scanning /home/shared/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
val: Caching images (0.1GB RAM): 100%|██████████| 128/128 [00:00<00:00, 984.03it/s]
Plotting labels to ultralytics/runs/detect/train/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.000714, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias(decay=0.0)

Image sizes 640 train, 640 val
Using 16 dataloader workers

Logging results to ultralytics/runs/detect/train

Starting training for 10 epochs...
Closing dataloader mosaic

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       1/10      2.37G      1.215      1.451      1.245         42        640: 100%|██████████| 8/8 [00:02<00:00,  3.45it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:01<00:00,  6.85it/s]
                   all        128        929      0.757      0.684       0.76      0.588

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       2/10      2.43G       1.21      1.482      1.245         48        640: 100%|██████████| 8/8 [00:00<00:00,  8.90it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:00<00:00, 15.82it/s]
                   all        128        929      0.748      0.665      0.764       0.58

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       3/10      2.44G      1.122      1.112      1.147         48        640: 100%|██████████| 8/8 [00:00<00:00,  9.47it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:00<00:00, 15.54it/s]
                   all        128        929      0.711      0.692      0.777      0.591

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       4/10      2.44G     0.9762     0.9867      1.098         62        640: 100%|██████████| 8/8 [00:00<00:00,  9.60it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:00<00:00, 15.94it/s]
                   all        128        929      0.771       0.71      0.797      0.623

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       5/10      2.49G     0.9425     0.9654      1.071         68        640: 100%|██████████| 8/8 [00:00<00:00,  9.25it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:00<00:00, 16.00it/s]
                   all        128        929      0.762      0.734      0.802      0.621

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       6/10      2.44G      1.026      0.899      1.084         31        640: 100%|██████████| 8/8 [00:00<00:00,  9.77it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:00<00:00, 15.68it/s]
                   all        128        929      0.826      0.733      0.809      0.629

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       7/10      2.48G     0.8806     0.8252      1.058         45        640: 100%|██████████| 8/8 [00:00<00:00,  9.26it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:00<00:00, 15.98it/s]
                   all        128        929      0.799      0.768      0.818      0.638

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       8/10      2.47G     0.8754      0.818      1.025         49        640: 100%|██████████| 8/8 [00:00<00:00,  9.14it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:00<00:00, 15.87it/s]
                   all        128        929      0.844      0.722      0.827      0.649

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
       9/10      2.49G     0.9863     0.8815      1.152         36        640: 100%|██████████| 8/8 [00:00<00:00,  9.62it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:00<00:00, 15.99it/s]
                   all        128        929      0.883       0.72      0.836      0.663

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      10/10      2.45G       0.98     0.8182      1.072         39        640: 100%|██████████| 8/8 [00:00<00:00, 10.12it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:00<00:00, 15.65it/s]
                   all        128        929      0.867      0.736      0.841      0.666

10 epochs completed in 0.006 hours.
Optimizer stripped from ultralytics/runs/detect/train/weights/last.pt, 22.6MB
Optimizer stripped from ultralytics/runs/detect/train/weights/best.pt, 22.6MB

Validating ultralytics/runs/detect/train/weights/best.pt...
Ultralytics YOLOv8.2.20 🚀 Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
                                                            CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)
Model summary (fused): 168 layers, 11156544 parameters, 0 gradients, 28.6 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:02<00:00,  3.27it/s]
                   all        128        929      0.869      0.738      0.841      0.666
                person        128        254      0.964      0.632      0.853      0.652
               bicycle        128          6      0.801      0.333      0.537      0.312
                   car        128         46          1      0.306      0.591      0.311
            motorcycle        128          5      0.917          1      0.995       0.89
              airplane        128          6      0.964          1      0.995      0.907
                   bus        128          7          1      0.791      0.995      0.883
                 train        128          3      0.962          1      0.995      0.808
                 truck        128         12      0.961        0.5       0.69      0.434
                  boat        128          6      0.878      0.667      0.791      0.522
         traffic light        128         14          1      0.249      0.356      0.285
             stop sign        128          2      0.898          1      0.995      0.848
                 bench        128          9          1      0.733      0.833      0.603
                  bird        128         16          1      0.881      0.995      0.707
                   cat        128          4      0.926          1      0.995      0.891
                   dog        128          9      0.891      0.908      0.984      0.819
                 horse        128          2      0.896          1      0.995       0.75
              elephant        128         17          1      0.923      0.955      0.795
                  bear        128          1      0.695          1      0.995      0.895
                 zebra        128          4      0.936          1      0.995      0.959
               giraffe        128          9          1      0.956      0.995       0.85
              backpack        128          6          1      0.703      0.837      0.594
              umbrella        128         18      0.821      0.889       0.95      0.704
               handbag        128         19      0.821      0.421      0.598       0.44
                   tie        128          7          1      0.802      0.858      0.618
              suitcase        128          4          1      0.813      0.995      0.616
               frisbee        128          5       0.89        0.8      0.804      0.684
                  skis        128          1      0.781          1      0.995      0.796
             snowboard        128          7      0.569      0.714      0.763      0.579
           sports ball        128          6          1      0.554       0.67      0.442
                  kite        128         10       0.89        0.3      0.583      0.296
          baseball bat        128          4      0.668       0.25      0.565       0.43
        baseball glove        128          7      0.935      0.429      0.439        0.3
            skateboard        128          5      0.623          1      0.938       0.59
         tennis racket        128          7      0.745      0.571      0.607      0.401
                bottle        128         18          1      0.353      0.752      0.499
            wine glass        128         16          1      0.485      0.703      0.501
                   cup        128         36      0.849      0.806      0.857       0.58
                  fork        128          6      0.802      0.333      0.644      0.478
                 knife        128         16      0.856      0.625       0.79      0.584
                 spoon        128         22        0.9      0.411      0.636      0.484
                  bowl        128         28        0.9      0.821      0.872      0.707
                banana        128          1      0.732          1      0.995      0.995
              sandwich        128          2      0.863          1      0.995      0.995
                orange        128          4      0.556       0.75      0.702      0.524
              broccoli        128         11      0.616      0.295      0.558      0.382
                carrot        128         24      0.838      0.647      0.874      0.647
               hot dog        128          2      0.876          1      0.995      0.995
                 pizza        128          5      0.967          1      0.995      0.904
                 donut        128         14      0.695          1      0.936      0.862
                  cake        128          4      0.944          1      0.995      0.905
                 chair        128         35      0.765      0.559      0.742      0.545
                 couch        128          6      0.816      0.749      0.852      0.708
          potted plant        128         14      0.873      0.929      0.955      0.781
                   bed        128          3      0.908          1      0.995       0.94
          dining table        128         13          1      0.757      0.854      0.738
                toilet        128          2      0.965          1      0.995      0.896
                    tv        128          2      0.898          1      0.995      0.799
                laptop        128          3      0.932          1      0.995      0.907
                 mouse        128          2      0.588        0.5      0.545      0.413
                remote        128          8          1      0.638      0.861      0.662
            cell phone        128          8          1      0.542       0.63      0.442
             microwave        128          3       0.74          1      0.995      0.952
                  oven        128          5       0.76        0.6      0.665      0.479
                  sink        128          6      0.901        0.5      0.783      0.658
          refrigerator        128          5      0.935          1      0.995        0.8
                  book        128         29      0.708      0.241      0.625      0.418
                 clock        128          9       0.89        0.9      0.973      0.818
                  vase        128          2      0.631          1      0.995      0.995
              scissors        128          1          1          0      0.995      0.219
            teddy bear        128         21      0.838       0.81      0.862      0.625
            toothbrush        128          5      0.734          1      0.995      0.836

Speed: 0.1ms preprocess, 3.3ms inference, 0.0ms loss, 3.1ms postprocess per image
Results saved to ultralytics/runs/detect/train

Ultralytics HUB: Syncing final model...
100%|██████████| 21.5M/21.5M [00:01<00:00, 12.3MB/s]
Ultralytics HUB: Done ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/bqwlsS8e96a9fzpRATuW 🚀

Burhan-Q · 2024-05-24T12:37:58Z

Note

This PR is related to ultralytics/hub#695 and ultralytics/hub#606

Burhan-Q · 2024-05-24T12:42:27Z

I attempted launching a local training while logged into my HUB account (both with and without DDP), but the HUB logging doesn't appear to work for either case on this branch.

from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful ✅
>>> True

model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, device=6)

No model was uploaded after training completes

ultralytics/engine/trainer.py

Laughing-q · 2024-05-24T13:07:02Z

I attempted launching a local training while logged into my HUB account (both with and without DDP), but the HUB logging doesn't appear to work for either case on this branch.
from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful ✅
>>> True

model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, device=6)
No model was uploaded after training completes

does this work properly with single-gpu mode on main branch?

Burhan-Q · 2024-05-24T13:11:38Z

@Laughing-q yes it does work when I switch to main

ultralytics/engine/trainer.py

Laughing-q · 2024-05-24T13:22:25Z

@Burhan-Q that's strange...for me the training goes to the step of creating hub instance.

again that's so far I'm able to debug.
@Burhan-Q I'm wondering perhaps you could add some printing here in the function to check if self.hub_session is created successfully?

Laughing-q · 2024-05-24T13:24:17Z

oh I was supposed to post this one

Burhan-Q · 2024-05-24T14:18:29Z

I tested out a modification to the trainer._setup_hub() method that gave me some interesting results

        if SETTINGS["hub"] and self.hub_session is None:
            # Create a model in HUB
            try:
                from ultralytics.hub.session import HUBTrainingSession

                session = HUBTrainingSession(self.args.model)
                self.hub_session = session if session.client.authenticated else self.hub_session
                if self.hub_session:
                    self.hub_session.create_model(self.args)
                    # Check model was created
                    if not self.hub_session.model:
                        self.hub_session = None
            except (PermissionError, ModuleNotFoundError):
                # Ignore PermissionError and ModuleNotFoundError which indicates hub-sdk not installed
                pass

With these changes, I suddenly get lots of these in the training log

hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.

Train log

from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful ✅
>>> True

model = YOLO("yolov8s-seg.pt")
result = model.train(data="coco8-seg.yaml", epochs=10, device=3)

Ultralytics YOLOv8.2.20 🚀 Python-3.10.12 torch-2.2.0+cu121 CUDA:3 (NVIDIA A100-SXM4-80GB, 81051MiB)
engine/trainer: task=segment, mode=train, model=yolov8s-seg.pt, data=coco8-seg.yaml, epochs=10, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=3, workers=8, project=None, name=train6, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=/home/burhan/tests/ultralytics/runs/segment/train6

                   from  n    params  module                                       arguments                     
  0                  -1  1       928  ultralytics.nn.modules.conv.Conv             [3, 32, 3, 2]                 
  1                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  2                  -1  1     29056  ultralytics.nn.modules.block.C2f             [64, 64, 1, True]             
  3                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  4                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  5                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  6                  -1  2    788480  ultralytics.nn.modules.block.C2f             [256, 256, 2, True]           
  7                  -1  1   1180672  ultralytics.nn.modules.conv.Conv             [256, 512, 3, 2]              
  8                  -1  1   1838080  ultralytics.nn.modules.block.C2f             [512, 512, 1, True]           
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    591360  ultralytics.nn.modules.block.C2f             [768, 256, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 16                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 19                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1   1969152  ultralytics.nn.modules.block.C2f             [768, 512, 1]                 
 22        [15, 18, 21]  1   2801504  ultralytics.nn.modules.head.Segment          [80, 32, 128, [128, 256, 512]]
YOLOv8s-seg summary: 261 layers, 11821056 parameters, 11821040 gradients, 42.9 GFLOPs

Transferred 417/417 items from pretrained weights
2024-05-24 09:13:41,106 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
2024-05-24 09:13:41,109 - hub_sdk.helpers.logger - ERROR - Received no response from the server while creating the model.
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
train: Scanning /home/shared/datasets/coco8-seg/labels/train.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<?, ?it/s]
val: Scanning /home/shared/datasets/coco8-seg/labels/val.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<?, ?it/s]
Plotting labels to /home/burhan/tests/ultralytics/runs/segment/train6/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.000119, momentum=0.9) with parameter groups 66 weight(decay=0.0), 77 weight(decay=0.0005), 76 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to /home/burhan/tests/ultralytics/runs/segment/train6
Starting training for 10 epochs...
Closing dataloader mosaic

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       1/10      1.52G     0.9476      2.702      1.978      1.283         13        640: 100%|██████████| 1/1 [00:00<00:00,  1.31it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  4.19it/s]
                   all          4         17      0.822      0.898       0.94      0.679      0.822      0.898      0.939      0.592

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       2/10      1.54G     0.9262      2.695      2.596      1.247         13        640: 100%|██████████| 1/1 [00:00<00:00,  9.71it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 14.64it/s]
                   all          4         17      0.832      0.905      0.941       0.68      0.832      0.905      0.941        0.6

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       3/10      1.57G     0.8526      2.626      2.046      1.234         13        640: 100%|██████████| 1/1 [00:00<00:00,  8.52it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 17.86it/s]
                   all          4         17      0.838      0.913      0.942       0.68      0.838      0.913      0.935      0.599

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       4/10      1.57G      1.143      3.088       2.44      1.395         13        640: 100%|██████████| 1/1 [00:00<00:00,  8.65it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 23.26it/s]
                   all          4         17      0.834      0.913       0.94      0.672      0.834      0.913      0.939      0.601

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       5/10      1.55G      1.009      2.831       2.98      1.271         13        640: 100%|██████████| 1/1 [00:00<00:00,  3.74it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95):   0%|          | 0/1 [00:00<?, ?it/s]2024-05-24 09:13:50,859 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 17.41it/s]
                   all          4         17      0.836      0.912      0.941      0.683      0.836      0.912      0.932      0.599

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       6/10      1.59G      1.268      3.083       2.17      1.654         13        640: 100%|██████████| 1/1 [00:00<00:00,  5.34it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 22.36it/s]
                   all          4         17      0.844      0.912      0.942      0.684      0.844      0.912      0.934      0.599

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       7/10      1.67G     0.7353      2.692      1.904      1.155         13        640: 100%|██████████| 1/1 [00:00<00:00,  9.19it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 22.59it/s]
                   all          4         17      0.889      0.911      0.942      0.684      0.889      0.911      0.934      0.602

2024-05-24 09:13:52,195 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       8/10      1.67G      1.062      2.927      2.253      1.209         13        640: 100%|██████████| 1/1 [00:00<00:00,  7.85it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 22.27it/s]
                   all          4         17      0.846      0.913      0.942      0.674      0.846      0.913      0.934      0.587

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       9/10      1.65G      1.024      2.655        2.2       1.32         13        640: 100%|██████████| 1/1 [00:00<00:00,  7.58it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 23.05it/s]
                   all          4         17      0.873      0.915      0.942      0.675      0.873      0.915      0.934      0.587

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
      10/10      1.65G     0.6717      1.964      1.453      1.056         13        640: 100%|██████████| 1/1 [00:00<00:00,  7.38it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 22.08it/s]
                   all          4         17      0.863      0.925      0.942      0.687      0.863      0.925      0.934      0.603

2024-05-24 09:13:54,152 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.

10 epochs completed in 0.002 hours.
2024-05-24 09:13:54,537 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Optimizer stripped from /home/burhan/tests/ultralytics/runs/segment/train6/weights/last.pt, 23.9MB
Optimizer stripped from /home/burhan/tests/ultralytics/runs/segment/train6/weights/best.pt, 23.9MB

Validating /home/burhan/tests/ultralytics/runs/segment/train6/weights/best.pt...
Ultralytics YOLOv8.2.20 🚀 Python-3.10.12 torch-2.2.0+cu121 CUDA:3 (NVIDIA A100-SXM4-80GB, 81051MiB)
YOLOv8s-seg summary (fused): 195 layers, 11810560 parameters, 0 gradients, 42.6 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 26.21it/s]
2024-05-24 09:13:55,403 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
                   all          4         17      0.863      0.924      0.942      0.671      0.863      0.924      0.934      0.587
                person          4         10      0.844      0.547      0.678      0.339      0.844      0.547      0.629      0.308
                   dog          4          1      0.737          1      0.995      0.895      0.737          1      0.995      0.895
                 horse          4          2      0.903          1      0.995       0.65      0.903          1      0.995      0.226
              elephant          4          2      0.946          1      0.995      0.448      0.946          1      0.995        0.4
              umbrella          4          1       0.75          1      0.995      0.895       0.75          1      0.995      0.895
          potted plant          4          1          1          1      0.995      0.796          1          1      0.995      0.796

2024-05-24 09:13:57,840 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
2024-05-24 09:13:58,935 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Speed: 0.1ms preprocess, 2.2ms inference, 0.0ms loss, 0.8ms postprocess per image
Results saved to /home/burhan/tests/ultralytics/runs/segment/train6
2024-05-24 09:14:00,430 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Ultralytics HUB: Syncing final model...
2024-05-24 09:14:01,795 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
2024-05-24 09:14:02,201 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
 59%|█████▉    | 13.5M/22.8M [00:00<00:00, 15.2MB/s]2024-05-24 09:14:04,112 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
100%|██████████| 22.8M/22.8M [00:01<00:00, 14.2MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/model.py", line 660, in train
    self.trainer.train()
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/trainer.py", line 205, in train
    self._do_train(world_size)
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/trainer.py", line 468, in _do_train
    self.run_callbacks("on_train_end")
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/trainer.py", line 165, in run_callbacks
    callback(self)
  File "/home/burhan/ultra_repo/ultralytics/ultralytics/utils/callbacks/hub.py", line 69, in on_train_end
    LOGGER.info(f"{PREFIX}Done ✅\n" f"{PREFIX}View model at {session.model_url} 🚀")
AttributeError: 'HUBTrainingSession' object has no attribute 'model_url'
>>> 2024-05-24 09:14:08,415 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.

Laughing-q · 2024-05-24T14:30:36Z

I attempted launching a local training while logged into my HUB account (both with and without DDP), but the HUB logging doesn't appear to work for either case on this branch.
from ultralytics import YOLO, hub

hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful ✅
>>> True

model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, device=6)
No model was uploaded after training completes

@glenn-jocher @Burhan-Q @sergiuwaxmann Guys I think I found a bug of hub-sdk package here...when using HUB logging for local training.
HUB logging does not work properly when cache != ram here:

ultralytics/ultralytics/hub/session.py

Line 96 in 654c37f

"cache": model_args.get("cache", "ram"),

which means HUB logging fails when we pass an arg to override cache(True or False) i.e using following script to launch a training locally on main branch won't get any logging on HUB:

from ultralytics import YOLO, hub
hub.login(API_KEY)
model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, cache=True)

meanwhile it throws hub error log:

2024-05-24 22:06:46,208 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
2024-05-24 22:06:46,210 - hub_sdk.helpers.logger - ERROR - Received no response from the server while creating the model

Based on the case, and the fact I updated the parameter of passing self.args to self.hub_session.create_model in this PR:

ultralytics/ultralytics/engine/trainer.py

Line 785 in 570f894

self.hub_session.create_model(self.args)

By default cache equals to False from self.args, hence no HUB logging.
It seems to me root is the cache issue from hub-sdk package.

Laughing-q · 2024-05-24T14:44:59Z

I could easily fix the issue in this PR by excluding cache from self.args before passing it to self.hub_session.create_model, but it's not the root issue to me.🤔

glenn-jocher · 2024-05-24T17:46:06Z

@Laughing-q @Burhan-Q @sergiuwaxmann strange. I think this PR might need some extra study, if we rush a solution we might just end up with more bugs. What we need is a solution that will log to HUB correctly when training from both:

Starting from HUB (single-GPU and DDP)
Starting from ultralytics after logging in to HUB with yolo hub login (single and DDP)

We should really strive to implement the logging identically to W&B, which works correctly with our callbacks in all scenarios.

I don't have much time today but I will look into this this weekend.

Laughing-q · 2024-06-23T15:01:50Z

@glenn-jocher I resolved the conflicts and eliminated the hub_model_url. And I tested all the cases(training locally or from HUB on both single-gpu mode and DDP mode) and all works properly.
Now this PR looks much cleaner. Thanks for the refactoring PR!

FYI I used to consider updating trainer.args.model to model_url directly without introducing a new attribute would cause issue to model initialization, but it turns out it wouldn't since the value of trainer.args.model has already been recorded to trainer.model in Trainer initialization, so it's free for us to use for model_url.

glenn-jocher · 2024-06-23T15:53:10Z

@Laughing-q wow this is great, much simpler!

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher · 2024-06-23T16:53:43Z

@Laughing-q I'm getting errors when training DDP from HUB to local.

I see there are dataset download issues also, this appears to be happening twice, so I think we need to only autodownload datasets on RANK -1, 0, but that might be a separate issue.

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: UltralyticsAssistant <web@ultralytics.com>

glenn-jocher · 2024-06-23T17:59:36Z

@Laughing-q I've been testing this some more and something strange is happening on dataset download from HUB (both single and multi-GPU), where coco8 is not unzipping correctly to ../datasets/coco8, it's unzipping to ../datasets/coco8/coco8. I'm going to merge this PR and try to figure out what's happening on the dataset unzip issue.

update

a18fc62

clean import

570f894

Burhan-Q added bug Something isn't working HUB Ultralytics HUB issues labels May 24, 2024

Burhan-Q reviewed May 24, 2024

View reviewed changes

ultralytics/engine/trainer.py Outdated Show resolved Hide resolved

Burhan-Q reviewed May 24, 2024

View reviewed changes

ultralytics/engine/trainer.py Outdated Show resolved Hide resolved

Merge branch 'main' into hub-session

e774613

glenn-jocher added 3 commits May 24, 2024 19:46

Merge branch 'main' into hub-session

3445b2e

Merge branch 'main' into hub-session

565430f

Merge branch 'main' into hub-session

63d6b9c

glenn-jocher added the TODO Items that needs completing label May 26, 2024

glenn-jocher and others added 4 commits May 27, 2024 17:32

Glenn removed extra session creation

ac26ee4

Merge branch 'main' into hub-session

a66f1c7

Merge branch 'main' into hub-session

1dbd2a6

Merge branch 'main' into hub-session

8aeb68c

Laughing-q added 2 commits June 23, 2024 22:09

reslove conflicts

69f995b

fix generate_ddp_file

1fde1cf

Laughing-q force-pushed the hub-session branch from 2c97e13 to 1fde1cf Compare June 23, 2024 14:14

UltralyticsAssistant and others added 8 commits June 23, 2024 14:15

Auto-format by https://ultralytics.com/actions

2f95681

update

f6f37e1

update

956b7c4

update

c259360

clean up

185e29f

Auto-format by https://ultralytics.com/actions

f89fa28

attempt to eliminate hub_model_url

2199873

update

e2668ef

Update hub.py

1b3c4b1

glenn-jocher added 3 commits June 23, 2024 17:57

Merge branch 'main' into hub-session

b11a4bb

Update session.py

dcf139f

Update __init__.py

c27b2f4

glenn-jocher removed the TODO Items that needs completing label Jun 23, 2024

glenn-jocher changed the title ~~Attempt to fix HUB session with DDP training~~ ultralytics 8.2.41 fix HUB session with DDP training Jun 23, 2024

glenn-jocher added 4 commits June 23, 2024 18:39

Refactor create_session

ed018fe

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

Merge remote-tracking branch 'origin/hub-session' into hub-session

5ef66c0

Refactor create_session

e39f333

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

Refactor create_session

08ceeaf

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

DDP autodownload fix attempt (#13910)

9719119

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: UltralyticsAssistant <web@ultralytics.com>

Update __init__.py

1cfcced

glenn-jocher changed the title ~~ultralytics 8.2.41 fix HUB session with DDP training~~ Fix HUB session with DDP training Jun 23, 2024

glenn-jocher merged commit 1696024 into main Jun 23, 2024
12 of 13 checks passed

glenn-jocher deleted the hub-session branch June 23, 2024 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HUB session with DDP training #13103

Fix HUB session with DDP training #13103

Laughing-q commented May 24, 2024 •

edited by github-actions bot

Loading

codecov bot commented May 24, 2024 •

edited

Loading

Laughing-q commented May 24, 2024 •

edited

Loading

Laughing-q commented May 24, 2024

glenn-jocher commented May 24, 2024

Burhan-Q commented May 24, 2024

Burhan-Q commented May 24, 2024

Burhan-Q commented May 24, 2024 •

edited

Loading

Laughing-q commented May 24, 2024 •

edited

Loading

Burhan-Q commented May 24, 2024

Laughing-q commented May 24, 2024

Laughing-q commented May 24, 2024

Burhan-Q commented May 24, 2024

Laughing-q commented May 24, 2024

Laughing-q commented May 24, 2024 •

edited

Loading

glenn-jocher commented May 24, 2024

Laughing-q commented Jun 23, 2024 •

edited

Loading

glenn-jocher commented Jun 23, 2024

glenn-jocher commented Jun 23, 2024

glenn-jocher commented Jun 23, 2024

Fix HUB session with DDP training #13103

Fix HUB session with DDP training #13103

Conversation

Laughing-q commented May 24, 2024 • edited by github-actions bot Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

codecov bot commented May 24, 2024 • edited Loading

Codecov Report

Laughing-q commented May 24, 2024 • edited Loading

Laughing-q commented May 24, 2024

glenn-jocher commented May 24, 2024

Burhan-Q commented May 24, 2024

Burhan-Q commented May 24, 2024

Burhan-Q commented May 24, 2024 • edited Loading

Laughing-q commented May 24, 2024 • edited Loading

Burhan-Q commented May 24, 2024

Laughing-q commented May 24, 2024

Laughing-q commented May 24, 2024

Burhan-Q commented May 24, 2024

Laughing-q commented May 24, 2024

Laughing-q commented May 24, 2024 • edited Loading

glenn-jocher commented May 24, 2024

Laughing-q commented Jun 23, 2024 • edited Loading

glenn-jocher commented Jun 23, 2024

glenn-jocher commented Jun 23, 2024

glenn-jocher commented Jun 23, 2024

Laughing-q commented May 24, 2024 •

edited by github-actions bot

Loading

codecov bot commented May 24, 2024 •

edited

Loading

Laughing-q commented May 24, 2024 •

edited

Loading

Burhan-Q commented May 24, 2024 •

edited

Loading

Laughing-q commented May 24, 2024 •

edited

Loading

Laughing-q commented May 24, 2024 •

edited

Loading

Laughing-q commented Jun 23, 2024 •

edited

Loading