Do not perform reverse update weights. #13364

cainiao123s · 2024-06-04T16:11:09Z

Search before asking

I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

Due to certain special needs, I don't need to update the model parameters, so I made a series of modifications to the training part of the code to some extent to meet my requirements, but I encountered a new problem. The loss calculated for each Epoch when the model parameters are not updated is not the same. After multiple calculations, the results are as follows each time:

loss_items: tensor([0.6322, 4.5460, 1.3188])	loss_items: tensor([2.1151, 5.2178, 2.1823])	loss_items: tensor([0.6720, 5.1917, 1.3830])
loss_items: tensor([1.8331, 4.7669, 2.0235])	loss_items: tensor([0.7512, 4.4382, 0.9191])	loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347])	loss_items: tensor([0.3070, 4.8387, 1.1524])	loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([2.0359, 4.1217, 1.7355])	loss_items: tensor([2.2144, 3.5352, 2.1413])	loss_items: tensor([1.7366, 4.7452, 1.9406])

Here is the training code:
from ultralytics.models.yolo.detect import DetectionTrainer
if name == 'main':

args = dict(model='yolov8s.pt', data='coco8.yaml', epochs=3, amp=False, batch=1,device='cpu',translate=0.0, scale=0.0)
trainer = DetectionTrainer(overrides=args)
train_generator = trainer._do_train()


def _do_train_(self, world_size=1):
    import random
    random.seed(0)
    torch.manual_seed(0)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(0)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    self._setup_train(world_size)

    nb = len(self.train_loader)  # number of batches
    nw = max(round(self.args.warmup_epochs * nb), 100) if self.args.warmup_epochs > 0 else -1  # warmup iterations
    LOGGER.info(f'Image sizes {self.args.imgsz} train, {self.args.imgsz} val\n'
                f'Using {self.train_loader.num_workers * (world_size or 1)} dataloader workers\n'
                f"Logging results to {colorstr('bold', self.save_dir)}\n"
                f'Starting training for {self.epochs} epochs...')
    if self.args.close_mosaic:
        base_idx = (self.epochs - self.args.close_mosaic) * nb
        self.plot_idx.extend([base_idx, base_idx + 1, base_idx + 2])
    epoch = self.epochs  # predefine for resume fully trained model edge cases
    for epoch in range(self.start_epoch, self.epochs):
        self.epoch = epoch
        self.model.train()
        pbar = enumerate(self.train_loader)

        if RANK in (-1, 0):
            LOGGER.info(self.progress_string())
            pbar = TQDM(enumerate(self.train_loader), total=nb)
        self.tloss = None
        with torch.no_grad():
            for i, batch in pbar:
                # Forward
                with torch.cuda.amp.autocast(self.amp):
                    batch = self.preprocess_batch(batch)
                    self.loss, self.loss_items = self.model(batch)
                    print('loss_items:',self.loss_items)

Additional

No response

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2024-06-04T22:20:22Z

@cainiao123s hello,

Thank you for providing detailed information about your issue. It seems you are encountering variability in loss values across epochs even though the model parameters are not being updated. This can happen due to several reasons, even with a fixed seed:

Data Loading: If there's any randomness in the way data is loaded or augmented during training, this could lead to variations in the loss. Since you've disabled parameter updates, the model itself should not be changing, but the input data might be.
Floating Point Precision: Operations on CPUs and GPUs can have slight differences in floating point precision, which might cause minor variations in computed loss.
Environment Factors: Other factors such as underlying library updates, OS-level updates, or hardware-specific optimizations might also influence results.

To further investigate, you might want to ensure that all data loading and augmentation steps are deterministic. Also, check if any external libraries or functions you use in the data preprocessing or loss computation might introduce randomness.

If the issue persists, providing more details about the data handling and any preprocessing steps might help in diagnosing the problem more effectively.

cainiao123s · 2024-06-05T00:34:50Z

After a long time of effort, I found that it is very likely that the data will be processed for each epoch. Do you operate like this?
Because I modified the category of a training label file, I am now using the file coco8.yaml, and the new data obtained is as follows:

loss_items: tensor([0.5801, 2.4663, 1.3199])	loss_items: tensor([2.1151, 5.2178, 2.1823])	loss_items: tensor([0.5253, 1.9911, 1.2469])
loss_items: tensor([1.8331, 4.7669, 2.0235])	loss_items: tensor([0.7512, 4.4382, 0.9191])	loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347])	loss_items: tensor([0.3981, 2.1305, 1.1685])	loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([2.0359, 4.1217, 1.7355])	loss_items: tensor([2.2144, 3.5352, 2.1413])	loss_items: tensor([1.7366, 4.7452, 1.9406])

The data above indicates that the first data in the first column, the third data in the second column, and the first data in the third column have changed, which suggests that some operations should be performed on the dataset for each epoch, and not every data loading process for each epoch is the same.

glenn-jocher · 2024-06-05T04:30:31Z

Hello @cainiao123s,

Thank you for sharing your observations. Yes, in YOLOv8, data processing can indeed vary for each epoch. This variation is typically due to the data augmentation techniques applied during training to enhance the model's ability to generalize. These techniques might include random transformations like scaling, cropping, flipping, or color adjustments, which can lead to different loss values even for the same image across different epochs.

The changes you're observing in the loss items across epochs are expected behavior when such data augmentations are active. If you need consistent data input for each epoch to test specific behaviors or modifications, you might consider disabling these augmentations temporarily.

Let us know if you have further questions or need more details on how to proceed! 😊

cainiao123s · 2024-06-06T03:17:15Z

I also conducted some other ablation experiments. I found the shuffle-related code and, according to the data, shuffle does change the data loading order for each epoch. However, I discovered other issues. On the same dataset, the loss values are not the same when using shuffle and not using shuffle (shuffle = mode == 'train' is equivalent to shuffle = True). Can you please provide an answer to this? I would like to investigate further.
The following is the code section:
def get_dataloader(self, dataset_path, batch_size=16, rank=0, mode='train'):
"""Construct and return dataloader."""
assert mode in ['train', 'val']
with torch_distributed_zero_first(rank): # init dataset *.cache only once if DDP
dataset = self.build_dataset(dataset_path, mode, batch_size)
# shuffle = mode == 'train'
# shuffle = False
# shuffle = True
if getattr(dataset, 'rect', False) and shuffle:
LOGGER.warning("WARNING ⚠️ 'rect=True' is incompatible with DataLoader shuffle, setting shuffle=False")
shuffle = False
workers = self.args.workers if mode == 'train' else self.args.workers * 2
return build_dataloader(dataset, batch_size, workers, shuffle, rank) # return dataloader
File path：ultralytics\models\yolo\detect\train.py\def get_dataloader

The following data were obtained from the ablation experiment：

shuffle = mode == 'val'，Original dataset

epoch1 | epoch2 | epoch3
loss_items: tensor([1.8969, 5.3689, 2.1035]) | loss_items: tensor([2.1151, 5.2178, 2.1823]) | loss_items: tensor([1.6455, 4.6298, 1.8679])
loss_items: tensor([2.1795, 3.3320, 2.0722]) | loss_items: tensor([1.5175, 4.0790, 1.6325]) | loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) | loss_items: tensor([0.9762, 4.7864, 1.2525]) | loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([0.5511, 4.3917, 1.2769]) | loss_items: tensor([1.0438, 5.8705, 1.6147]) | loss_items: tensor([0.7897, 5.4591, 1.3969])
| |
shuffle = mode == 'train'，Original dataset
epoch1 | epoch2 | epoch3
loss_items: tensor([0.6322, 4.5460, 1.3188]) | loss_items: tensor([2.1151, 5.2178, 2.1823]) | loss_items: tensor([0.6720, 5.1917, 1.3830])
loss_items: tensor([1.8331, 4.7669, 2.0235]) | loss_items: tensor([0.7512, 4.4382, 0.9191]) | loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) | loss_items: tensor([0.3070, 4.8387, 1.1524]) | loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([2.0359, 4.1217, 1.7355]) | loss_items: tensor([2.2144, 3.5352, 2.1413]) | loss_items: tensor([1.7366, 4.7452, 1.9406])
| |
shuffle = mode == 'val'，Changed dataset
epoch1 | epoch2 | epoch3
loss_items: tensor([1.8969, 5.3689, 2.1035]) | loss_items: tensor([2.1151, 5.2178, 2.1823]) | loss_items: tensor([1.6455, 4.6298, 1.8679])
loss_items: tensor([2.1795, 3.3320, 2.0722]) | loss_items: tensor([1.5175, 4.0790, 1.6325]) | loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) | loss_items: tensor([0.9762, 4.7864, 1.2525]) | loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([0.5414, 2.3891, 1.3085]) | loss_items: tensor([0.5728, 2.3244, 1.3329]) | loss_items: tensor([0.4131, 2.2960, 1.2324])
| |
shuffle = mode == 'train'，Changed dataset
epoch1 | epoch2 | epoch3
loss_items: tensor([0.5801, 2.4663, 1.3199]) | loss_items: tensor([2.1151, 5.2178, 2.1823]) | loss_items: tensor([0.5253, 1.9911, 1.2469])
loss_items: tensor([1.8331, 4.7669, 2.0235]) | loss_items: tensor([0.7512, 4.4382, 0.9191]) | loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) | loss_items: tensor([0.3981, 2.1305, 1.1685]) | loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([2.0359, 4.1217, 1.7355]) | loss_items: tensor([2.2144, 3.5352, 2.1413]) | loss_items: tensor([1.7366, 4.7452, 1.9406])

github-actions · 2024-07-08T00:18:12Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

cainiao123s added the question Further information is requested label Jun 4, 2024

github-actions bot added the Stale label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not perform reverse update weights. #13364

Do not perform reverse update weights. #13364

cainiao123s commented Jun 4, 2024

glenn-jocher commented Jun 4, 2024

cainiao123s commented Jun 5, 2024

glenn-jocher commented Jun 5, 2024

cainiao123s commented Jun 6, 2024

github-actions bot commented Jul 8, 2024

Do not perform reverse update weights. #13364

Do not perform reverse update weights. #13364

Comments

cainiao123s commented Jun 4, 2024

Search before asking

Question

Additional

glenn-jocher commented Jun 4, 2024

cainiao123s commented Jun 5, 2024

glenn-jocher commented Jun 5, 2024

cainiao123s commented Jun 6, 2024

shuffle = mode == 'val'，Original dataset

github-actions bot commented Jul 8, 2024