Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not perform reverse update weights. #13364

Open
1 task done
cainiao123s opened this issue Jun 4, 2024 · 5 comments
Open
1 task done

Do not perform reverse update weights. #13364

cainiao123s opened this issue Jun 4, 2024 · 5 comments
Labels
question Further information is requested Stale

Comments

@cainiao123s
Copy link

Search before asking

Question

Due to certain special needs, I don't need to update the model parameters, so I made a series of modifications to the training part of the code to some extent to meet my requirements, but I encountered a new problem. The loss calculated for each Epoch when the model parameters are not updated is not the same. After multiple calculations, the results are as follows each time:

loss_items: tensor([0.6322, 4.5460, 1.3188]) loss_items: tensor([2.1151, 5.2178, 2.1823]) loss_items: tensor([0.6720, 5.1917, 1.3830])
loss_items: tensor([1.8331, 4.7669, 2.0235]) loss_items: tensor([0.7512, 4.4382, 0.9191]) loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) loss_items: tensor([0.3070, 4.8387, 1.1524]) loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([2.0359, 4.1217, 1.7355]) loss_items: tensor([2.2144, 3.5352, 2.1413]) loss_items: tensor([1.7366, 4.7452, 1.9406])

Here is the training code:
from ultralytics.models.yolo.detect import DetectionTrainer
if name == 'main':

args = dict(model='yolov8s.pt', data='coco8.yaml', epochs=3, amp=False, batch=1,device='cpu',translate=0.0, scale=0.0)
trainer = DetectionTrainer(overrides=args)
train_generator = trainer._do_train()


def _do_train_(self, world_size=1):
    import random
    random.seed(0)
    torch.manual_seed(0)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(0)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    self._setup_train(world_size)

    nb = len(self.train_loader)  # number of batches
    nw = max(round(self.args.warmup_epochs * nb), 100) if self.args.warmup_epochs > 0 else -1  # warmup iterations
    LOGGER.info(f'Image sizes {self.args.imgsz} train, {self.args.imgsz} val\n'
                f'Using {self.train_loader.num_workers * (world_size or 1)} dataloader workers\n'
                f"Logging results to {colorstr('bold', self.save_dir)}\n"
                f'Starting training for {self.epochs} epochs...')
    if self.args.close_mosaic:
        base_idx = (self.epochs - self.args.close_mosaic) * nb
        self.plot_idx.extend([base_idx, base_idx + 1, base_idx + 2])
    epoch = self.epochs  # predefine for resume fully trained model edge cases
    for epoch in range(self.start_epoch, self.epochs):
        self.epoch = epoch
        self.model.train()
        pbar = enumerate(self.train_loader)

        if RANK in (-1, 0):
            LOGGER.info(self.progress_string())
            pbar = TQDM(enumerate(self.train_loader), total=nb)
        self.tloss = None
        with torch.no_grad():
            for i, batch in pbar:
                # Forward
                with torch.cuda.amp.autocast(self.amp):
                    batch = self.preprocess_batch(batch)
                    self.loss, self.loss_items = self.model(batch)
                    print('loss_items:',self.loss_items)

Additional

No response

@cainiao123s cainiao123s added the question Further information is requested label Jun 4, 2024
@glenn-jocher
Copy link
Member

@cainiao123s hello,

Thank you for providing detailed information about your issue. It seems you are encountering variability in loss values across epochs even though the model parameters are not being updated. This can happen due to several reasons, even with a fixed seed:

  1. Data Loading: If there's any randomness in the way data is loaded or augmented during training, this could lead to variations in the loss. Since you've disabled parameter updates, the model itself should not be changing, but the input data might be.

  2. Floating Point Precision: Operations on CPUs and GPUs can have slight differences in floating point precision, which might cause minor variations in computed loss.

  3. Environment Factors: Other factors such as underlying library updates, OS-level updates, or hardware-specific optimizations might also influence results.

To further investigate, you might want to ensure that all data loading and augmentation steps are deterministic. Also, check if any external libraries or functions you use in the data preprocessing or loss computation might introduce randomness.

If the issue persists, providing more details about the data handling and any preprocessing steps might help in diagnosing the problem more effectively.

@cainiao123s
Copy link
Author

After a long time of effort, I found that it is very likely that the data will be processed for each epoch. Do you operate like this?
Because I modified the category of a training label file, I am now using the file coco8.yaml, and the new data obtained is as follows:

loss_items: tensor([0.5801, 2.4663, 1.3199]) loss_items: tensor([2.1151, 5.2178, 2.1823]) loss_items: tensor([0.5253, 1.9911, 1.2469])
loss_items: tensor([1.8331, 4.7669, 2.0235]) loss_items: tensor([0.7512, 4.4382, 0.9191]) loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) loss_items: tensor([0.3981, 2.1305, 1.1685]) loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([2.0359, 4.1217, 1.7355]) loss_items: tensor([2.2144, 3.5352, 2.1413]) loss_items: tensor([1.7366, 4.7452, 1.9406])

The data above indicates that the first data in the first column, the third data in the second column, and the first data in the third column have changed, which suggests that some operations should be performed on the dataset for each epoch, and not every data loading process for each epoch is the same.

@glenn-jocher
Copy link
Member

Hello @cainiao123s,

Thank you for sharing your observations. Yes, in YOLOv8, data processing can indeed vary for each epoch. This variation is typically due to the data augmentation techniques applied during training to enhance the model's ability to generalize. These techniques might include random transformations like scaling, cropping, flipping, or color adjustments, which can lead to different loss values even for the same image across different epochs.

The changes you're observing in the loss items across epochs are expected behavior when such data augmentations are active. If you need consistent data input for each epoch to test specific behaviors or modifications, you might consider disabling these augmentations temporarily.

Let us know if you have further questions or need more details on how to proceed! 😊

@cainiao123s
Copy link
Author

I also conducted some other ablation experiments. I found the shuffle-related code and, according to the data, shuffle does change the data loading order for each epoch. However, I discovered other issues. On the same dataset, the loss values are not the same when using shuffle and not using shuffle (shuffle = mode == 'train' is equivalent to shuffle = True). Can you please provide an answer to this? I would like to investigate further.
The following is the code section:
def get_dataloader(self, dataset_path, batch_size=16, rank=0, mode='train'):
"""Construct and return dataloader."""
assert mode in ['train', 'val']
with torch_distributed_zero_first(rank): # init dataset *.cache only once if DDP
dataset = self.build_dataset(dataset_path, mode, batch_size)
# shuffle = mode == 'train'
# shuffle = False
# shuffle = True
if getattr(dataset, 'rect', False) and shuffle:
LOGGER.warning("WARNING ⚠️ 'rect=True' is incompatible with DataLoader shuffle, setting shuffle=False")
shuffle = False
workers = self.args.workers if mode == 'train' else self.args.workers * 2
return build_dataloader(dataset, batch_size, workers, shuffle, rank) # return dataloader
File path:ultralytics\models\yolo\detect\train.py\def get_dataloader

The following data were obtained from the ablation experiment:

<style> </style>

shuffle = mode == 'val',Original dataset

epoch1 | epoch2 | epoch3
loss_items: tensor([1.8969, 5.3689, 2.1035]) | loss_items: tensor([2.1151, 5.2178, 2.1823]) | loss_items: tensor([1.6455, 4.6298, 1.8679])
loss_items: tensor([2.1795, 3.3320, 2.0722]) | loss_items: tensor([1.5175, 4.0790, 1.6325]) | loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) | loss_items: tensor([0.9762, 4.7864, 1.2525]) | loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([0.5511, 4.3917, 1.2769]) | loss_items: tensor([1.0438, 5.8705, 1.6147]) | loss_items: tensor([0.7897, 5.4591, 1.3969])
  |   |  
shuffle = mode == 'train',Original dataset
epoch1 | epoch2 | epoch3
loss_items: tensor([0.6322, 4.5460, 1.3188]) | loss_items: tensor([2.1151, 5.2178, 2.1823]) | loss_items: tensor([0.6720, 5.1917, 1.3830])
loss_items: tensor([1.8331, 4.7669, 2.0235]) | loss_items: tensor([0.7512, 4.4382, 0.9191]) | loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) | loss_items: tensor([0.3070, 4.8387, 1.1524]) | loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([2.0359, 4.1217, 1.7355]) | loss_items: tensor([2.2144, 3.5352, 2.1413]) | loss_items: tensor([1.7366, 4.7452, 1.9406])
  |   |  
shuffle = mode == 'val',Changed dataset
epoch1 | epoch2 | epoch3
loss_items: tensor([1.8969, 5.3689, 2.1035]) | loss_items: tensor([2.1151, 5.2178, 2.1823]) | loss_items: tensor([1.6455, 4.6298, 1.8679])
loss_items: tensor([2.1795, 3.3320, 2.0722]) | loss_items: tensor([1.5175, 4.0790, 1.6325]) | loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) | loss_items: tensor([0.9762, 4.7864, 1.2525]) | loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([0.5414, 2.3891, 1.3085]) | loss_items: tensor([0.5728, 2.3244, 1.3329]) | loss_items: tensor([0.4131, 2.2960, 1.2324])
  |   |  
shuffle = mode == 'train',Changed dataset
epoch1 | epoch2 | epoch3
loss_items: tensor([0.5801, 2.4663, 1.3199]) | loss_items: tensor([2.1151, 5.2178, 2.1823]) | loss_items: tensor([0.5253, 1.9911, 1.2469])
loss_items: tensor([1.8331, 4.7669, 2.0235]) | loss_items: tensor([0.7512, 4.4382, 0.9191]) | loss_items: tensor([2.5066, 3.5849, 2.0813])
loss_items: tensor([0.9504, 4.8735, 1.2347]) | loss_items: tensor([0.3981, 2.1305, 1.1685]) | loss_items: tensor([0.7035, 5.8424, 0.9036])
loss_items: tensor([2.0359, 4.1217, 1.7355]) | loss_items: tensor([2.2144, 3.5352, 2.1413]) | loss_items: tensor([1.7366, 4.7452, 1.9406])

Copy link

github-actions bot commented Jul 8, 2024

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale label Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

2 participants