Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow variable number of repetitions for RA #5084

Merged
merged 2 commits into from
Dec 20, 2021
Merged

Conversation

tbennun
Copy link
Contributor

@tbennun tbennun commented Dec 10, 2021

The number of repetitions in the RA sampler was set to 3. With this PR it can be changed through CLI flags.

Related to #5051.

cc @datumbox

@facebook-github-bot
Copy link

facebook-github-bot commented Dec 10, 2021

💊 CI failures summary and remediations

As of commit 9a89f9d (more details on the Dr. CI page):


  • 1/2 failures introduced in this PR
  • 1/2 broken upstream at merge base 1efb567 on Dec 19 from 5:50am to 5:54am

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build CodeQL / build (1/1)

Step: "Build TorchVision" (full log | diagnosis details | 🔁 rerun)

2021-12-20T08:58:36.5460093Z ##[error]Process completed with exit code 1.
2021-12-20T08:58:36.3990086Z     self.finalize_options()
2021-12-20T08:58:36.3991080Z   File "/home/runner/.local/lib/python3.8/site-packages/setuptools/command/develop.py", line 52, in finalize_options
2021-12-20T08:58:36.3991980Z     easy_install.finalize_options(self)
2021-12-20T08:58:36.3993030Z   File "/home/runner/.local/lib/python3.8/site-packages/setuptools/command/easy_install.py", line 263, in finalize_options
2021-12-20T08:58:36.3993834Z     self._fix_install_dir_for_user_site()
2021-12-20T08:58:36.3994885Z   File "/home/runner/.local/lib/python3.8/site-packages/setuptools/command/easy_install.py", line 375, in _fix_install_dir_for_user_site
2021-12-20T08:58:36.3995733Z     self.select_scheme(scheme_name)
2021-12-20T08:58:36.3996760Z   File "/home/runner/.local/lib/python3.8/site-packages/setuptools/command/easy_install.py", line 716, in select_scheme
2021-12-20T08:58:36.3997539Z     scheme = INSTALL_SCHEMES[name]
2021-12-20T08:58:36.3998118Z KeyError: 'unix_user'
2021-12-20T08:58:36.5460093Z ##[error]Process completed with exit code 1.
2021-12-20T08:58:36.5558109Z Post job cleanup.
2021-12-20T08:58:36.6718960Z [command]/usr/bin/git version
2021-12-20T08:58:36.6773069Z git version 2.34.1
2021-12-20T08:58:36.6805717Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2021-12-20T08:58:36.6852070Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :
2021-12-20T08:58:36.7200990Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2021-12-20T08:58:36.7242490Z http.https://github.com/.extraheader
2021-12-20T08:58:36.7252906Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2021-12-20T08:58:36.7295262Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :
2021-12-20T08:58:36.7721913Z Cleaning up orphan processes

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tbennun. Overall it looks a reasonable addition. I've added a question below, let me know your thoughts.

parser.add_argument("--ra-sampler", action="store_true", help="whether to use ra_sampler in training")
parser.add_argument("--ra-sampler", action="store_true", help="whether to use Repeated Augmentation in training")
parser.add_argument(
"--ra-reps", default=3, type=int, help="number of repetitions for Repeated Augmentation (default: 3)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on whether we should have both a boolean and an integer param VS combining the two a single? Other parameters (such as --clip-grad-norm, --auto-augment) take the latter approach by defining default=None or default=0 to turn off the feature by default. Though having an extra boolean option is useful when you have many config parameters for the feature, it also creates noise when reading the training log to determine which inputs were actually active for the training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree, though I followed the style of the EMA flag(s). Should I switch to a more general --repeated-augmentations with default 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input. :) When we introduced EMA, we offered 2 parameters --model-ema-steps and --model-ema-decay that's why we didnt follow this approach.

@sallysyw Any thoughts on which approach we should choose here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Two things to consider here:

  1. Users may not be aware of what is a "good value" for the number of repetitions. This is solvable though: it could be given as an example in the help of the argument.
  2. In my fork, I'm adding another boolean flag to keep the same epoch length (rather than cutting it down by the number of repetitions) and replicate the samples within the same minibatch (and GPU), following the original Batch Augmentation paper. If it improves generalization more than the DeiT approach, would you be interested in this as part of another PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning point 1, I agree with your arguments and I don't have strong opinions. I think your PR is mergeable as-is but I'll let @sallysyw who contributed the class to decide whether we should go with 2 parameters or 1.

Concerning point 2, unless the gap in accuracy is massive, I would keep things simple and leave it as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed the discussion before. I think for this specific usage case, having 2 parameters for RA are better than 1 because, as @tbennun mentioned, users might don't have a good sense on which ra_reps to use. So it's important to have a non-zero default (also for the backward compatibility of replicating ViT training).

@tbennun
Copy link
Contributor Author

tbennun commented Dec 20, 2021

@sallysyw @datumbox any news about this PR? I can confirm that with this PR (ra_reps=4) and the latest ResNet-50/ImageNet recipe, there is an improvement in top1 accuracy to 80.802%

@datumbox
Copy link
Contributor

@tbennun Sounds good let's merge it. Do you have the log file and checkpoint for the ResNet50 run?

@datumbox datumbox merged commit 66d57fe into pytorch:main Dec 20, 2021
@github-actions
Copy link

Hey @datumbox!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

@tbennun
Copy link
Contributor Author

tbennun commented Dec 20, 2021

@tbennun Sounds good let's merge it. Do you have the log file and checkpoint for the ResNet50 run?

@datumbox

Checkpoints:

The script didn't store a log file, so I only have the last few printouts (would be nice if it did by default :) ). However, it did store all the models from every epoch. I attached above the relevant two (last, best), and here is the log of running with --test-only on the last 50 models: test.log

The command line I used for training is as follows (as opposed to the blog, I trained on four GPUs rather than eight, but with the same overall batch size):

torchrun --nproc_per_node=4 train.py --model resnet50 --batch-size 256 --lr 0.5 --lr-scheduler cosineannealinglr --lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 600 --random-erase 0.1 --weight-decay 0.00002 --norm-weight-decay 0.0 --label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --train-crop-size 176 --model-ema --val-resize-size 232 --data-path /path/to/imagenet --cache-dataset --ra-sampler --ra-reps=4

I am now testing the mode where a full epoch is used on the same system (training will see the same number of samples as the original, but with 4x augmentations).

@datumbox
Copy link
Contributor

@tbennun Thanks for sharing the checkpoints and the info. This is great!

It's a bummer you don't got the log. We aim for full reproducibility and we normally store the logs and all checkpoints for every model we release. I'm going to fire a new set of training with --ra-sampler --ra-reps=4 on our side to see if we can reproduce and if we do, I'll be happy to update the checkpoint. We'll work out a way on the PR to credit you as the author (we still try to find out what's the best way to receive such contributions by the community so, bear with me).

facebook-github-bot pushed a commit that referenced this pull request Dec 21, 2021
Summary: Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com>

Reviewed By: prabhat00155

Differential Revision: D33253472

fbshipit-source-id: d193892d8cfcd8b7dbc6cc04603eb19f65c9c3e3
@datumbox
Copy link
Contributor

datumbox commented Jan 1, 2022

@tbennun Happy new year! I was able to verify your findings and get a run which achieves:

Acc@1 80.858 Acc@5 95.434

Would you be interested sending a PR so that you can be credited for the contribution? Given that you don't have access to deploy the weights, I can do that for you. I propose the following:

  1. Write a comment on the issue Improve the accuracy of Classification models by using SOTA recipes and primitives #3995, briefly describing the modification on top of the main recipe. Make is self sufficient, so that someone who reads it understands what you did to achieve the improvement. Feel free to include papers/references that inspired the change.
  2. Send a new PR which modifies the accuracies of the V2 weights to the values I mention above. Moreover change the recipe link to point to your previous comment.
  3. Make sure you have the "Allow edits from maintainers" checkbox on the PR checked, so that I can update the URL of the weights once I deploy them on S3.

@tbennun
Copy link
Contributor Author

tbennun commented Jan 3, 2022

@datumbox Happy new year! That's great! I'll take care of the comments and the PR. I'm currently running the other version I mentioned before (saving all logs this time :) ), is it ok if we wait with merging the new PR until training completes? This way we can upload a modified version if the accuracy improvement is substantial.

@datumbox
Copy link
Contributor

datumbox commented Jan 4, 2022

@tbennun Yes, no problem at all. If the accuracy diff is substantial, we can deploy yours. Though in that case, we would need to find a way to send us all the checkpoints and log files associated with the model. This is several hundreds of GB, so it might be a bit tricky but I'm sure we will work it out.

@datumbox
Copy link
Contributor

@tbennun Just checking to see if you have any update concerning the training. Thanks!

@tbennun
Copy link
Contributor Author

tbennun commented Jan 11, 2022

@tbennun Just checking to see if you have any update concerning the training. Thanks!

@datumbox yes. It's almost complete (epoch 552 right now) and max top1 accuracy is 81.05% (at epoch 513, top5 95.52%), but it's not getting any higher at this point. This is the exact same recipe but with the full BA. I kept the log and all checkpoints as well, but you can judge if this is something worth uploading. What do you think?

@datumbox
Copy link
Contributor

@tbennun Sounds good, let's review the full details once the training is completed. Please upload somewhere the best epoch checkpoint once the training is completed along with the log so that we can check. If we do end up using your run, it's going to be a challenge fetching all the logs and checkpoints. Bear with us as we try to work these details out; you are the first person who wants to contribute a half TB run, so that comes with its own challenges. 😄

@tbennun
Copy link
Contributor Author

tbennun commented Jan 13, 2022

@datumbox Training complete.

Some notes:

  • Some epochs appear more than once in the log. This is because of time limits on the cluster, but the run was restarted each time with --resume (20 times). Since the entire optimizer state is serialized I think this is ok(?)
  • I have all the weights, and I think we can find a solution to transfer them (temporarily placing them on a server)
  • One issue with the code: to enable the full BA run, I used the fact that the number of repetitions matches the number of nodes and simply fixed the seed across the nodes. Now that I know the gains could be interesting, I can work on a proper, more elaborate PR and a flag separately. The current diff is as follows:
diff --git a/references/classification/sampler.py b/references/classification/sampler.py
index 3c5e8b01..9ae16500 100644
--- a/references/classification/sampler.py
+++ b/references/classification/sampler.py
@@ -15,7 +15,7 @@ class RASampler(torch.utils.data.Sampler):
     https://github.com/facebookresearch/deit/blob/main/samplers.py
     """

-    def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True, seed=0, repetitions=3):
+    def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True, seed=0, repetitions=3, whole_dataset=True):
         if num_replicas is None:
             if not dist.is_available():
                 raise RuntimeError("Requires distributed package to be available!")
@@ -34,6 +34,7 @@ class RASampler(torch.utils.data.Sampler):
         self.shuffle = shuffle
         self.seed = seed
         self.repetitions = repetitions
+        self.alldata = whole_dataset

     def __iter__(self):
         # Deterministically shuffle based on epoch
@@ -44,6 +45,9 @@ class RASampler(torch.utils.data.Sampler):
         else:
             indices = list(range(len(self.dataset)))

+        if self.alldata and self.num_replicas == self.repetitions:
+            return iter(indices)
+
         # Add extra samples to make it evenly divisible
         indices = [ele for ele in indices for i in range(self.repetitions)]
         indices += indices[: (self.total_size - len(indices))]
@@ -56,7 +60,7 @@ class RASampler(torch.utils.data.Sampler):
         return iter(indices[: self.num_selected_samples])

     def __len__(self):
-        return self.num_selected_samples
+        return len(self.dataset) if self.alldata else self.num_selected_samples

     def set_epoch(self, epoch):
         self.epoch = epoch

@datumbox
Copy link
Contributor

@tbennun Thanks for the update.

I assume that by nodes you meant the number of available GPUs. Judging from the patch you provided, I understand that each GPU actually runs the full size of the dataset, meaning that for ra_reps=4 it takes 4x longer to train the model. This is why according to your log, the total number of iters per epoch is 5005 instead of 1252 which is the expected.

Unfortunately I don't think we would like to make such a modification to the RASampler, because we try to keep our implementations canonical. Moreover, I suspect that one could achieve a similar result by running the training for 4x more epochs. On the other hand, your original proposal of using the RASampler with 4 repetitions is a quite elegant as it keeps the training budget the same while still increasing the accuracy by 0.2 points. We can also use the checkpoint with accuracy 80.858 which will save us from having to exchange 1 TB worth of checkpoints.

If you send a PR with the changes described here, I can help you deploy the pre-trained model on PyTorch's S3 bucket and release it. Let me know what you think.

@tbennun
Copy link
Contributor Author

tbennun commented Jan 13, 2022

@datumbox agreed on all counts. If that's alright with you, I will create the PR over this weekend. Thanks again for your patience!

@datumbox
Copy link
Contributor

Of course, sounds great. Looking forward to the contribution.

@tbennun
Copy link
Contributor Author

tbennun commented Jan 16, 2022

@datumbox Done:
Recipe and reference: #3995 (comment)
PR: #5201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants