Fix `overrides` training cfg bug #10002

DseidLi · 2024-04-12T02:10:45Z

Pull Request Overview

This PR addresses a bug related to configuration overrides in the training code of the YOLO model within the ultralytics framework. The issue was identified during routine training operations and stems from the improper handling of the overrides dictionary, which could potentially lead to missing critical model parameters during training sessions.

Background

While working with the YOLO model, instantiated from the ultralytics.engine.model class, I observed that model parameters could be lost during the configuration merging process. This realization came about during a deep dive into how the ultralytics.engine.model module initializes and loads models.

Issue Description

Here is a succinct breakdown of the problematic behavior observed:

Model Initialization:
- Models are typically loaded using model = YOLO('models/pretrained/yolov8s-cls.pt'), which relies on the underlying mechanism provided by ultralytics.engine.model.py.
- The model initialization either loads a new YOLO model if the file extension is .yaml or .yml or proceeds to load an existing model with methods like _load.
Model Configuration Handling:
- During the initialization (_load), overrides which include important model parameters like model path and task, are set.
- However, if a new configuration is supplied (cfg parameter), the existing overrides dictionary is completely replaced by the new configuration without merging the important existing parameters.
Consequences:
- This replacement causes the loss of crucial model parameters, such as the model path (model) and the task type (task), which are vital for the model’s training operations.
- As a result, subsequent training operations do not have access to these essential parameters, leading to potential failures or incorrect behaviors.

Solution

The proposed fix involves modifying the configuration merging process to ensure that any new configurations are merged with existing ones rather than replacing them outright. This preserves all necessary parameters and ensures stable and reliable model training operations.

Technical Details

My training code is as follows:

```python
new_config_path = create_new_config_file(args.config_dir, config)
if args.resume:
    model = YOLO(args.resume)
else:
    model = YOLO('models/pretrained/yolov8s-cls.pt')
model.train(
    data=args.dataset_dir,
    epochs=1000,
    # freeze=5,
    batch=256,
    device=[0, 1, 2, 3],
    # device=[1],
    imgsz=224,
    cache='disk',
    val=False,
    workers=5,
    cfg=new_config_path,
    name=args.timestamp,
)

In this code, I first load the model using model = YOLO('models/pretrained/yolov8s-cls.pt'). At this point, the model object is actually a class YOLO(Model), which inherits from from ultralytics.engine.model import Model.

Then, I explored the ultralytics.engine.model, which certainly requires initialization. I found that it initializes here in ultralytics/engine/model.py with the following content:

# Load or create new YOLO model
if Path(model).suffix in {".yaml", ".yml"}:
    self._new(model, task=task, verbose=verbose)
else:
    self._load(model, task=task)

Specifically, this line: self._load(model, task=task).

The initialization process of the load method is detailed in ultralytics/engine/model.py as follows:
weights here is actully something like models/pretrained/yolov8s-cls.pt

def _load(self, weights: str, task=None) -> None:
    """
    Initializes a new model and infers the task type from the model head.

    Args:
        weights (str): model checkpoint to be loaded
        task (str | None): model task
    """
    if weights.lower().startswith(("https://", "http://", "rtsp://", "rtmp://", "tcp://")):
        weights = checks.check_file(weights)  # automatically download and return local filename
    weights = checks.check_model_file_from_stem(weights)  # add suffix, i.e. yolov8n -> yolov8n.pt

    if Path(weights).suffix == ".pt":
        print(Path(weights))
        self.model, self.ckpt = attempt_load_one_weight(weights)
        self.task = self.model.args["task"]
        self.overrides = self.model.args = self._reset_ckpt_args(self.model.args)
        self.ckpt_path = self.model.pt_path
        
    else:
        weights = checks.check_file(weights)  # runs in all cases, not redundant with above call
        self.model, self.ckpt = weights, None
        self.task = task or guess_model_task(weights)
        self.ckpt_path = weights
    self.overrides["model"] = weights
    self.overrides["task"] = self.task
    self.model_name = weights

As you can see, this process updates the overrides dictionary and includes the model parameter, which is actually the path models/pretrained/yolov8s-cls.pt from model = YOLO('models/pretrained/yolov8s-cls.pt'). However, following the original code in ultralytics/engine/model.py:

overrides = yaml_load(checks.check_yaml(kwargs["cfg"])) if kwargs.get("cfg") else self.overrides

Replacing it directly like this would lead to the loss of parameters, specifically the recently created self.overrides["model"] and self.overrides["task"]. This would cause issues when calling the trainer later on:

custom = {"data": DEFAULT_CFG_DICT["data"] or TASK2DATA[self.task]}  # method defaults
args = {**overrides, **custom, **kwargs, "mode": "train"}  # highest priority args on the right
if args.get("resume"):
    args["resume"] = self.ckpt_path

self.trainer = (trainer or self._smart_load("trainer"))(overrides=args, _callbacks=self.callbacks)

This results in the loss of the model's name, ultimately causing an issue where str(self.model) unexpectedly becomes None. Therefore, I have fixed this bug.
ultralytics/models/yolo/classify/train.py

def setup_model(self):
    """Load, create or download model for any task."""
    import torchvision  # scope for faster 'import ultralytics'

    if isinstance(self.model, torch.nn.Module):  # if model is loaded beforehand. No setup needed
        return
    model, ckpt = str(self.model), None
    print(model, ckpt)
    # HERE WILL ALWAYS BE (None, None)!!! 
    if model.endswith(".pt"):
        self.model, ckpt = attempt_load_one_weight(model, device="cpu")
        for p in self.model.parameters():
            p.requires_grad = True  # for training
    elif model.split(".")[-1] in {"yaml", "yml"}:
        self.model = self.get_model(cfg=model)
    elif model in torchvision.models.__dict__:
        self.model = torchvision.models.__dict__[model](weights="IMAGENET1K_V1" if self.args.pretrained else None)
    else:
        raise FileNotFoundError(f"ERROR: model={model} not found locally or online. Please check model name.")
    ClassificationModel.reshape_outputs(self.model, self.data["nc"])

    return ckpt

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced configuration handling in training method.

📊 Key Changes

The training method now efficiently manages configurations sourced from multiple inputs (e.g., default settings, task-specifications, and direct user inputs).
Introduction of a more comprehensive handling of the data, model, and task configuration parameters, ensuring that user-defined settings (cfg) take precedence if provided.

🎯 Purpose & Impact

Purpose: To streamline the process of configuring model training, making it more intuitive and flexible for the users. These changes aim to minimize configuration errors and make the codebase more adaptable to different scenarios.
Impact: Users will experience a more seamless and error-resistant setup process for training their models. This update ensures that training configurations are more predictable and user-friendly, potentially leading to improved user satisfaction and better model performance outcomes. 🚀

codecov · 2024-04-12T02:13:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.19%. Comparing base (d608565) to head (674c67e).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10002      +/-   ##
==========================================
- Coverage   75.99%   75.19%   -0.81%     
==========================================
  Files         121      121              
  Lines       15332    15332              
==========================================
- Hits        11652    11529     -123     
- Misses       3680     3803     +123

Flag	Coverage Δ
Benchmarks	`35.78% <0.00%> (-0.24%)`	⬇️
GPU	`37.92% <100.00%> (-0.02%)`	⬇️
Tests	`70.63% <100.00%> (-0.67%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

glenn-jocher · 2024-04-13T15:56:13Z

@DseidLi thanks for the PR! Yes I understand your concern here. Will take a look. @Laughing-q what do you think here?

DseidLi · 2024-04-13T20:25:29Z

Thank you for reviewing my PR! Perhaps my previous explanation was too detailed. I will summarize the purpose of this PR in one sentence:
If the model is initially instantiated with model = YOLO('models/pretrained/yolov8s-cls.pt'), and then model.train() function specifies cfg=config_path, where config_path corresponding to a config does not redundantly define the key model with its value (as I assume most people would not re-specify 'model' : 'models/pretrained/yolov8s-cls.pt' within the config), then during multi-GPU training it will result in an error because model=None.

To help simplify understanding, I will give a specific example below to reproduce this bug:

model = YOLO('models/pretrained/yolov8s-cls.pt')
config = {
    "task": args.task,
    "mode": args.mode,
    "amp": args.amp,
    "hsv_h": args.hsv_h,
    "hsv_s": args.hsv_s,
    "hsv_v": args.hsv_v,
    "degrees": args.degrees,
    "translate": args.translate,
    "scale": args.scale,
    "shear": args.shear,
    "perspective": args.perspective,
    "flipud": args.flipud,
    "fliplr": args.fliplr,
    "mosaic": args.mosaic,
    "mixup": args.mixup,
    "copy_paste": args.copy_paste,
}

new_config_path = create_new_config_file(args.config_dir, config)

model.train(
    data=args.dataset_dir,
    epochs=1000,
    batch=256,
    device=[0, 1, 2, 3], # This ensures multi-GPU training which is necessary to reproduce this bug
    imgsz=224,
    cache='disk',
    val=False,
    workers=5,
    cfg=new_config_path,
    name=args.timestamp,
)

As you can see, this uses multi-GPU training with a default DDP (Distributed Data Parallel) strategy. Initially, I instantiated the model object as follows:

model = YOLO('models/pretrained/yolov8s-cls.pt')

Therefore, I am certain of my model configuration when passing cfg=new_config_path. I wouldn't want to re-specify my model's parameters in the cfg again. That's reasonable, as they were already included inmodel = YOLO('models/pretrained/yolov8s-cls.pt')
. However, based on my testing, this configuration will definitely cause an error. The error message is similar to the following:

Traceback (most recent call last):
  File "/home/xxxxx/.config/Ultralytics/DDP/_temp_uo7v_kms22746647789328.py", line 12, in <module>
    results = trainer.train()
  File "/home/
xxxx/miniconda3/envs/pytorch/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 198, in train
    self._do_train(world_size)
  File "/home/xxxxx/miniconda3/envs/pytorch/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 312, in _do_train
    self._setup_train(world_size)
  File "/home/xxxxx/miniconda3/envs/pytorch/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 226, in _setup_train
    ckpt = self.setup_model()
  File "/home/xxxx/miniconda3/envs/pytorch/lib/python3.10/site-packages/ultralytics/models/yolo/classify/train.py", line 81, in setup_model
    raise FileNotFoundError(f"ERROR: model={model} not found locally or online. Please check model name.")
FileNotFoundError: ERROR: model=None not found locally or online. Please check model name.

By the way, after my testing, this bug seems to only appear in multi-GPU training, not single-GPU. The results of single-GPU training are normal and do not require my PR to fix it.

glenn-jocher · 2024-04-14T05:04:01Z

Hey there! 👋 First off, huge thanks for making your PR explanation even clearer. You're definitely on point with the scenario you've described. It sounds like the crux of the issue is how the cfg gets merged with the existing override parameters—especially crucial when not redefining model in cfg leads to a hiccup in multi-GPU setups.

Your example does a great job highlighting the subtle yet impactful bug. 🐛 It's clear how this situation could trip up the model configuration under multi-GPU training scenarios.

We’ll take a closer look into this, considering the multi-GPU context you’ve pointed out. Your effort to bring this to attention and provide a detailed explanation is much appreciated! Let's work together to streamline this part of the training process. 🚀

Laughing-q · 2024-04-14T10:08:44Z

@glenn-jocher @DseidLi Yeah I think the model created from YOLO("model") should have high priority since it's explicitly defined in the code by user, which means the self.overrides["model"] and self.overrides["task"] should have higher priority than the ones from cfg. Just like explicitly passed kwargs by user has higher priority than overrides:

ultralytics/ultralytics/engine/model.py

Line 645 in d608565

    
           args = {**overrides, **custom, **kwargs, "mode": "train"}  # highest priority args on the right

Meanwhile I found another issue when putting data into cfg(which could also happen in user cases), and it turns out that the would be overwritten by custom and the training would start from coco8.yaml by default:

ultralytics/ultralytics/engine/model.py

Lines 644 to 645 in d608565

    
           custom = {"data": DEFAULT_CFG_DICT["data"] or TASK2DATA[self.task]}  # method defaults 
        
           args = {**overrides, **custom, **kwargs, "mode": "train"}  # highest priority args on the right

Reproduce the data overwritten issue:

from ultralytics.utils import DEFAULT_CFG_PATH, yaml_save, yaml_load
from ultralytics import YOLO

cfg = yaml_load(DEFAULT_CFG_PATH)
cfg["data"] = "coco128.yaml"
yaml_save("test_cfg.yaml", cfg)
model = YOLO("yolov8n.pt")
model.train(cfg="./test_cfg.yaml")

I expected the training to start from data=coco128.yaml but eventually it launched a training from coco8.yaml.

Laughing-q · 2024-04-14T10:13:37Z

@glenn-jocher @DseidLi I made a update to handle the priority issue of model and task, and also fixed the data overwritten issue by using coco8/yaml only when data is None. What do you guys think of this?
I tested locally and both cases seem to be resolved to me.
@DseidLi Please test it and see if it fixes the issue in you case. Thanks!

DseidLi · 2024-04-14T10:49:43Z

Hi @Laughing-q I've tested the updates and both issues are resolved effectively. Everything looks good to me. Great work on addressing these bugs!

glenn-jocher · 2024-04-14T13:36:56Z

@DseidLi @Laughing-q thanks for contributions guys, but this is not that simple. We can NOT change cfg handling for a single method, there are many model methods (predict, val, train, export, track, etc) that all currently handle the various levels of arguments in the same indentical way, and this PR is breaking that consistency:

args = {**overrides, **custom, **kwargs, "mode": "train"}  # highest priority args on the right

Laughing-q · 2024-04-14T13:50:00Z

@glenn-jocher yep I saw that, but figured that train method could be a bit different since it's the only method that considers the kwargs["cfg"] case.

ultralytics/ultralytics/engine/model.py

Line 643 in d608565

    
           overrides = yaml_load(checks.check_yaml(kwargs["cfg"])) if kwargs.get("cfg") else self.overrides

EDIT: We can revert the custom part if you feel that's better, as the data overwritten issue is also a edge case I think.

glenn-jocher · 2024-04-14T14:02:15Z

@Laughing-q @DseidLi we have an existing argument called custom that is serving the exact purpose you need right? Why not use it instead of modifying the rest of the code. You can add any additional arguments here that you want to take priority over the overrides

custom = {"data": DEFAULT_CFG_DICT["data"] or TASK2DATA[self.task], ... anything else here}  # method defaults

overrides = yaml_load(checks.check_yaml(kwargs["cfg"])) if kwargs.get("cfg") else self.overrides
custom = {"data": DEFAULT_CFG_DICT["data"] or TASK2DATA[self.task]}  # method defaults
args = {**overrides, **custom, **kwargs, "mode": "train"}  # highest priority args on the right

You can add any keys you want to the custom dict that you are concerned about no?

- custom = {"data": DEFAULT_CFG_DICT["data"] or TASK2DATA[self.task]}  # method defaults
+ custom = {"data": DEFAULT_CFG_DICT["data"] or TASK2DATA[self.task], "model": self.overrides["model"], "task": self.task}

Laughing-q · 2024-04-14T14:08:07Z

@glenn-jocher yeah you are right! we should move model and task keys in the custom. Meanwhile probably keep this line to fix the data overwritten issue:

overrides["data"] = overrides.get("data") or DEFAULT_CFG_DICT["data"] or TASK2DATA[self.task]

Laughing-q · 2024-04-14T14:10:22Z

@glenn-jocher @DseidLi This 0637bfb should be better. :)

EDIT: I just forgot about the self.overrides is still lying there after passing it to the local variable overrides.

glenn-jocher · 2024-04-14T14:11:45Z

@Laughing-q hey that is a lot better!!

glenn-jocher · 2024-04-14T14:24:43Z

@Laughing-q @DseidLi PR merged, nice work guys!!

@DseidLi let us know if you spot any other problems :)

Co-authored-by: UltralyticsAssistant <web@ultralytics.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: Ultralytics AI Assistant <135830346+UltralyticsAssistant@users.noreply.github.com> Co-authored-by: Laughing-q <1185102784@qq.com>

DseidLi and others added 2 commits April 12, 2024 09:43

fix overrides paras bug

92059f4

Auto-format by https://ultralytics.com/actions

abbc26e

DseidLi changed the title ~~fix overrides paras bug~~ Fix Overrides Paras Bug Apr 12, 2024

Merge branch 'main' into fix_overrides

e170ad2

Merge branch 'main' into fix_overrides

71afefc

update model.py

7168328

Auto-format by https://ultralytics.com/actions

2f729a4

fix

22a2fc4

add custom back

0637bfb

Update model.py

674c67e

glenn-jocher changed the title ~~Fix Overrides Paras Bug~~ Fix overrides training cfg bug Apr 14, 2024

Merge branch 'main' into fix_overrides

4621319

glenn-jocher merged commit 3bbbdcf into ultralytics:main Apr 14, 2024
10 checks passed

DseidLi deleted the fix_overrides branch April 14, 2024 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `overrides` training cfg bug #10002

Fix `overrides` training cfg bug #10002

DseidLi commented Apr 12, 2024 •

edited by github-actions bot

codecov bot commented Apr 12, 2024 •

edited

glenn-jocher commented Apr 13, 2024 •

edited

DseidLi commented Apr 13, 2024

glenn-jocher commented Apr 14, 2024

Laughing-q commented Apr 14, 2024 •

edited

Laughing-q commented Apr 14, 2024 •

edited

DseidLi commented Apr 14, 2024

glenn-jocher commented Apr 14, 2024 •

edited

Laughing-q commented Apr 14, 2024 •

edited

glenn-jocher commented Apr 14, 2024 •

edited

Laughing-q commented Apr 14, 2024

Laughing-q commented Apr 14, 2024 •

edited

glenn-jocher commented Apr 14, 2024

glenn-jocher commented Apr 14, 2024

Fix overrides training cfg bug #10002

Fix overrides training cfg bug #10002

Conversation

DseidLi commented Apr 12, 2024 • edited by github-actions bot

Pull Request Overview

Background

Issue Description

Solution

Technical Details

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

codecov bot commented Apr 12, 2024 • edited

Codecov Report

glenn-jocher commented Apr 13, 2024 • edited

DseidLi commented Apr 13, 2024

glenn-jocher commented Apr 14, 2024

Laughing-q commented Apr 14, 2024 • edited

Laughing-q commented Apr 14, 2024 • edited

DseidLi commented Apr 14, 2024

glenn-jocher commented Apr 14, 2024 • edited

Laughing-q commented Apr 14, 2024 • edited

glenn-jocher commented Apr 14, 2024 • edited

Laughing-q commented Apr 14, 2024

Laughing-q commented Apr 14, 2024 • edited

glenn-jocher commented Apr 14, 2024

glenn-jocher commented Apr 14, 2024

Fix `overrides` training cfg bug #10002

Fix `overrides` training cfg bug #10002

DseidLi commented Apr 12, 2024 •

edited by github-actions bot

codecov bot commented Apr 12, 2024 •

edited

glenn-jocher commented Apr 13, 2024 •

edited

Laughing-q commented Apr 14, 2024 •

edited

Laughing-q commented Apr 14, 2024 •

edited

glenn-jocher commented Apr 14, 2024 •

edited

Laughing-q commented Apr 14, 2024 •

edited

glenn-jocher commented Apr 14, 2024 •

edited

Laughing-q commented Apr 14, 2024 •

edited