Allow variable number of repetitions for RA #5084

tbennun · 2021-12-10T08:33:51Z

The number of repetitions in the RA sampler was set to 3. With this PR it can be changed through CLI flags.

Related to #5051.

cc @datumbox

facebook-github-bot · 2021-12-10T08:40:07Z

💊 CI failures summary and remediations

As of commit 9a89f9d (more details on the Dr. CI page):

1/2 failures introduced in this PR
1/2 broken upstream at merge base 1efb567 on Dec 19 from 5:50am to 5:54am

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

CodeQL / build (1/1)

Step: "Build TorchVision" (full log | diagnosis details | 🔁 rerun)

2021-12-20T08:58:36.5460093Z ##[error]Process completed with exit code 1.

2021-12-20T08:58:36.3990086Z     self.finalize_options()
2021-12-20T08:58:36.3991080Z   File "/home/runner/.local/lib/python3.8/site-packages/setuptools/command/develop.py", line 52, in finalize_options
2021-12-20T08:58:36.3991980Z     easy_install.finalize_options(self)
2021-12-20T08:58:36.3993030Z   File "/home/runner/.local/lib/python3.8/site-packages/setuptools/command/easy_install.py", line 263, in finalize_options
2021-12-20T08:58:36.3993834Z     self._fix_install_dir_for_user_site()
2021-12-20T08:58:36.3994885Z   File "/home/runner/.local/lib/python3.8/site-packages/setuptools/command/easy_install.py", line 375, in _fix_install_dir_for_user_site
2021-12-20T08:58:36.3995733Z     self.select_scheme(scheme_name)
2021-12-20T08:58:36.3996760Z   File "/home/runner/.local/lib/python3.8/site-packages/setuptools/command/easy_install.py", line 716, in select_scheme
2021-12-20T08:58:36.3997539Z     scheme = INSTALL_SCHEMES[name]
2021-12-20T08:58:36.3998118Z KeyError: 'unix_user'
2021-12-20T08:58:36.5460093Z ##[error]Process completed with exit code 1.
2021-12-20T08:58:36.5558109Z Post job cleanup.
2021-12-20T08:58:36.6718960Z [command]/usr/bin/git version
2021-12-20T08:58:36.6773069Z git version 2.34.1
2021-12-20T08:58:36.6805717Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2021-12-20T08:58:36.6852070Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :
2021-12-20T08:58:36.7200990Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2021-12-20T08:58:36.7242490Z http.https://github.com/.extraheader
2021-12-20T08:58:36.7252906Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2021-12-20T08:58:36.7295262Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :
2021-12-20T08:58:36.7721913Z Cleaning up orphan processes

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

binary_win_wheel_py3.9_cu115 on Dec 19 from 5:50am to 5:54am (1efb567 - 40be657)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

datumbox

Thanks @tbennun. Overall it looks a reasonable addition. I've added a question below, let me know your thoughts.

datumbox · 2021-12-10T09:46:57Z

references/classification/train.py

-    parser.add_argument("--ra-sampler", action="store_true", help="whether to use ra_sampler in training")
+    parser.add_argument("--ra-sampler", action="store_true", help="whether to use Repeated Augmentation in training")
+    parser.add_argument(
+        "--ra-reps", default=3, type=int, help="number of repetitions for Repeated Augmentation (default: 3)"


Thoughts on whether we should have both a boolean and an integer param VS combining the two a single? Other parameters (such as --clip-grad-norm, --auto-augment) take the latter approach by defining default=None or default=0 to turn off the feature by default. Though having an extra boolean option is useful when you have many config parameters for the feature, it also creates noise when reading the training log to determine which inputs were actually active for the training.

I completely agree, though I followed the style of the EMA flag(s). Should I switch to a more general --repeated-augmentations with default 0?

Thanks for the input. :) When we introduced EMA, we offered 2 parameters --model-ema-steps and --model-ema-decay that's why we didnt follow this approach.

@sallysyw Any thoughts on which approach we should choose here?

Makes sense. Two things to consider here:

Users may not be aware of what is a "good value" for the number of repetitions. This is solvable though: it could be given as an example in the help of the argument.

In my fork, I'm adding another boolean flag to keep the same epoch length (rather than cutting it down by the number of repetitions) and replicate the samples within the same minibatch (and GPU), following the original Batch Augmentation paper. If it improves generalization more than the DeiT approach, would you be interested in this as part of another PR?

Concerning point 1, I agree with your arguments and I don't have strong opinions. I think your PR is mergeable as-is but I'll let @sallysyw who contributed the class to decide whether we should go with 2 parameters or 1.

Concerning point 2, unless the gap in accuracy is massive, I would keep things simple and leave it as is.

Sorry I missed the discussion before. I think for this specific usage case, having 2 parameters for RA are better than 1 because, as @tbennun mentioned, users might don't have a good sense on which ra_reps to use. So it's important to have a non-zero default (also for the backward compatibility of replicating ViT training).

tbennun · 2021-12-20T07:55:15Z

@sallysyw @datumbox any news about this PR? I can confirm that with this PR (ra_reps=4) and the latest ResNet-50/ImageNet recipe, there is an improvement in top1 accuracy to 80.802%

datumbox · 2021-12-20T08:58:47Z

@tbennun Sounds good let's merge it. Do you have the log file and checkpoint for the ResNet50 run?

github-actions · 2021-12-20T08:59:54Z

Hey @datumbox!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

tbennun · 2021-12-20T12:54:47Z

@tbennun Sounds good let's merge it. Do you have the log file and checkpoint for the ResNet50 run?

@datumbox

Checkpoints:

Last (epoch 600): https://polybox.ethz.ch/index.php/s/1y4nRKxNMYRjcRf
Best (epoch 566, 80.808% Top1): https://polybox.ethz.ch/index.php/s/rlMIvIhpP7nbnt6

The script didn't store a log file, so I only have the last few printouts (would be nice if it did by default :) ). However, it did store all the models from every epoch. I attached above the relevant two (last, best), and here is the log of running with --test-only on the last 50 models: test.log

The command line I used for training is as follows (as opposed to the blog, I trained on four GPUs rather than eight, but with the same overall batch size):

torchrun --nproc_per_node=4 train.py --model resnet50 --batch-size 256 --lr 0.5 --lr-scheduler cosineannealinglr --lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 600 --random-erase 0.1 --weight-decay 0.00002 --norm-weight-decay 0.0 --label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --train-crop-size 176 --model-ema --val-resize-size 232 --data-path /path/to/imagenet --cache-dataset --ra-sampler --ra-reps=4

I am now testing the mode where a full epoch is used on the same system (training will see the same number of samples as the original, but with 4x augmentations).

datumbox · 2021-12-21T00:27:24Z

@tbennun Thanks for sharing the checkpoints and the info. This is great!

It's a bummer you don't got the log. We aim for full reproducibility and we normally store the logs and all checkpoints for every model we release. I'm going to fire a new set of training with --ra-sampler --ra-reps=4 on our side to see if we can reproduce and if we do, I'll be happy to update the checkpoint. We'll work out a way on the PR to credit you as the author (we still try to find out what's the best way to receive such contributions by the community so, bear with me).

Summary: Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com> Reviewed By: prabhat00155 Differential Revision: D33253472 fbshipit-source-id: d193892d8cfcd8b7dbc6cc04603eb19f65c9c3e3

datumbox · 2022-01-01T14:10:36Z

@tbennun Happy new year! I was able to verify your findings and get a run which achieves:

Acc@1 80.858 Acc@5 95.434

Would you be interested sending a PR so that you can be credited for the contribution? Given that you don't have access to deploy the weights, I can do that for you. I propose the following:

Write a comment on the issue Improve the accuracy of Classification models by using SOTA recipes and primitives #3995, briefly describing the modification on top of the main recipe. Make is self sufficient, so that someone who reads it understands what you did to achieve the improvement. Feel free to include papers/references that inspired the change.
Send a new PR which modifies the accuracies of the V2 weights to the values I mention above. Moreover change the recipe link to point to your previous comment.
Make sure you have the "Allow edits from maintainers" checkbox on the PR checked, so that I can update the URL of the weights once I deploy them on S3.

tbennun · 2022-01-03T22:08:25Z

@datumbox Happy new year! That's great! I'll take care of the comments and the PR. I'm currently running the other version I mentioned before (saving all logs this time :) ), is it ok if we wait with merging the new PR until training completes? This way we can upload a modified version if the accuracy improvement is substantial.

datumbox · 2022-01-04T08:59:10Z

@tbennun Yes, no problem at all. If the accuracy diff is substantial, we can deploy yours. Though in that case, we would need to find a way to send us all the checkpoints and log files associated with the model. This is several hundreds of GB, so it might be a bit tricky but I'm sure we will work it out.

datumbox · 2022-01-11T20:45:10Z

@tbennun Just checking to see if you have any update concerning the training. Thanks!

tbennun · 2022-01-11T21:01:04Z

@tbennun Just checking to see if you have any update concerning the training. Thanks!

@datumbox yes. It's almost complete (epoch 552 right now) and max top1 accuracy is 81.05% (at epoch 513, top5 95.52%), but it's not getting any higher at this point. This is the exact same recipe but with the full BA. I kept the log and all checkpoints as well, but you can judge if this is something worth uploading. What do you think?

datumbox · 2022-01-11T21:18:06Z

@tbennun Sounds good, let's review the full details once the training is completed. Please upload somewhere the best epoch checkpoint once the training is completed along with the log so that we can check. If we do end up using your run, it's going to be a challenge fetching all the logs and checkpoints. Bear with us as we try to work these details out; you are the first person who wants to contribute a half TB run, so that comes with its own challenges. 😄

tbennun · 2022-01-13T20:14:56Z

@datumbox Training complete.

The log is here (61 MB): https://polybox.ethz.ch/index.php/s/2nfZy95KqP7izkj
Weights for epoch 513 (81.052% top1, 95.52% top5): https://polybox.ethz.ch/index.php/s/7jvmWkO5lkXQKLP

Some notes:

Some epochs appear more than once in the log. This is because of time limits on the cluster, but the run was restarted each time with --resume (20 times). Since the entire optimizer state is serialized I think this is ok(?)
I have all the weights, and I think we can find a solution to transfer them (temporarily placing them on a server)
One issue with the code: to enable the full BA run, I used the fact that the number of repetitions matches the number of nodes and simply fixed the seed across the nodes. Now that I know the gains could be interesting, I can work on a proper, more elaborate PR and a flag separately. The current diff is as follows:

diff --git a/references/classification/sampler.py b/references/classification/sampler.py
index 3c5e8b01..9ae16500 100644
--- a/references/classification/sampler.py
+++ b/references/classification/sampler.py
@@ -15,7 +15,7 @@ class RASampler(torch.utils.data.Sampler):
     https://github.com/facebookresearch/deit/blob/main/samplers.py
     """

-    def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True, seed=0, repetitions=3):
+    def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True, seed=0, repetitions=3, whole_dataset=True):
         if num_replicas is None:
             if not dist.is_available():
                 raise RuntimeError("Requires distributed package to be available!")
@@ -34,6 +34,7 @@ class RASampler(torch.utils.data.Sampler):
         self.shuffle = shuffle
         self.seed = seed
         self.repetitions = repetitions
+        self.alldata = whole_dataset

     def __iter__(self):
         # Deterministically shuffle based on epoch
@@ -44,6 +45,9 @@ class RASampler(torch.utils.data.Sampler):
         else:
             indices = list(range(len(self.dataset)))

+        if self.alldata and self.num_replicas == self.repetitions:
+            return iter(indices)
+
         # Add extra samples to make it evenly divisible
         indices = [ele for ele in indices for i in range(self.repetitions)]
         indices += indices[: (self.total_size - len(indices))]
@@ -56,7 +60,7 @@ class RASampler(torch.utils.data.Sampler):
         return iter(indices[: self.num_selected_samples])

     def __len__(self):
-        return self.num_selected_samples
+        return len(self.dataset) if self.alldata else self.num_selected_samples

     def set_epoch(self, epoch):
         self.epoch = epoch

datumbox · 2022-01-13T21:48:04Z

@tbennun Thanks for the update.

I assume that by nodes you meant the number of available GPUs. Judging from the patch you provided, I understand that each GPU actually runs the full size of the dataset, meaning that for ra_reps=4 it takes 4x longer to train the model. This is why according to your log, the total number of iters per epoch is 5005 instead of 1252 which is the expected.

Unfortunately I don't think we would like to make such a modification to the RASampler, because we try to keep our implementations canonical. Moreover, I suspect that one could achieve a similar result by running the training for 4x more epochs. On the other hand, your original proposal of using the RASampler with 4 repetitions is a quite elegant as it keeps the training budget the same while still increasing the accuracy by 0.2 points. We can also use the checkpoint with accuracy 80.858 which will save us from having to exchange 1 TB worth of checkpoints.

If you send a PR with the changes described here, I can help you deploy the pre-trained model on PyTorch's S3 bucket and release it. Let me know what you think.

tbennun · 2022-01-13T22:16:04Z

@datumbox agreed on all counts. If that's alright with you, I will create the PR over this weekend. Thanks again for your patience!

datumbox · 2022-01-14T10:54:00Z

Of course, sounds great. Looking forward to the contribution.

tbennun · 2022-01-16T16:33:41Z

@datumbox Done:
Recipe and reference: #3995 (comment)
PR: #5201

Allow variable number of repetitions for RA

e6d4868

pytorch-probot bot added the ciflow/default label Dec 10, 2021

facebook-github-bot added the cla signed label Dec 10, 2021

datumbox reviewed Dec 10, 2021

View reviewed changes

datumbox requested a review from yiwen-song December 10, 2021 12:28

Merge branch 'main' into ra-reps

9a89f9d

datumbox approved these changes Dec 20, 2021

View reviewed changes

datumbox merged commit 66d57fe into pytorch:main Dec 20, 2021

datumbox added enhancement module: reference scripts labels Dec 20, 2021

tbennun deleted the ra-reps branch December 20, 2021 11:16

datumbox mentioned this pull request Dec 23, 2021

Adding pretrained ViT weights #5085

Merged

tbennun mentioned this pull request Jan 16, 2022

Improve the accuracy of Classification models by using SOTA recipes and primitives #3995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow variable number of repetitions for RA #5084

Allow variable number of repetitions for RA #5084

tbennun commented Dec 10, 2021 •

edited by pytorch-probot bot

Loading

facebook-github-bot commented Dec 10, 2021 •

edited

Loading

datumbox left a comment

datumbox Dec 10, 2021

tbennun Dec 10, 2021

datumbox Dec 10, 2021

tbennun Dec 10, 2021

datumbox Dec 10, 2021

yiwen-song Jan 11, 2022

tbennun commented Dec 20, 2021

datumbox commented Dec 20, 2021

github-actions bot commented Dec 20, 2021

tbennun commented Dec 20, 2021 •

edited

Loading

datumbox commented Dec 21, 2021

datumbox commented Jan 1, 2022

tbennun commented Jan 3, 2022 •

edited

Loading

datumbox commented Jan 4, 2022

datumbox commented Jan 11, 2022

tbennun commented Jan 11, 2022 •

edited

Loading

datumbox commented Jan 11, 2022

tbennun commented Jan 13, 2022

datumbox commented Jan 13, 2022

tbennun commented Jan 13, 2022

datumbox commented Jan 14, 2022

tbennun commented Jan 16, 2022

Allow variable number of repetitions for RA #5084

Allow variable number of repetitions for RA #5084

Conversation

tbennun commented Dec 10, 2021 • edited by pytorch-probot bot Loading

facebook-github-bot commented Dec 10, 2021 • edited Loading

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

CodeQL / build (1/1)

🚧 1 fixed upstream failure:

datumbox left a comment

Choose a reason for hiding this comment

datumbox Dec 10, 2021

Choose a reason for hiding this comment

tbennun Dec 10, 2021

Choose a reason for hiding this comment

datumbox Dec 10, 2021

Choose a reason for hiding this comment

tbennun Dec 10, 2021

Choose a reason for hiding this comment

datumbox Dec 10, 2021

Choose a reason for hiding this comment

yiwen-song Jan 11, 2022

Choose a reason for hiding this comment

tbennun commented Dec 20, 2021

datumbox commented Dec 20, 2021

github-actions bot commented Dec 20, 2021

tbennun commented Dec 20, 2021 • edited Loading

datumbox commented Dec 21, 2021

datumbox commented Jan 1, 2022

tbennun commented Jan 3, 2022 • edited Loading

datumbox commented Jan 4, 2022

datumbox commented Jan 11, 2022

tbennun commented Jan 11, 2022 • edited Loading

datumbox commented Jan 11, 2022

tbennun commented Jan 13, 2022

datumbox commented Jan 13, 2022

tbennun commented Jan 13, 2022

datumbox commented Jan 14, 2022

tbennun commented Jan 16, 2022

tbennun commented Dec 10, 2021 •

edited by pytorch-probot bot

Loading

facebook-github-bot commented Dec 10, 2021 •

edited

Loading

tbennun commented Dec 20, 2021 •

edited

Loading

tbennun commented Jan 3, 2022 •

edited

Loading

tbennun commented Jan 11, 2022 •

edited

Loading