diff --git a/references/video_classification/README.md b/references/video_classification/README.md index c387e2e7158..9bd1b9cc285 100644 --- a/references/video_classification/README.md +++ b/references/video_classification/README.md @@ -18,11 +18,11 @@ We assume the training and validation AVI videos are stored at `/data/kinectics4 Run the training on a single node with 8 GPUs: ```bash -torchrun --nproc_per_node=8 train.py --data-path=/data/kinectics400 --kinetics-version="400" --batch-size=16 --cache-dataset --sync-bn --amp +torchrun --nproc_per_node=8 train.py --data-path=/data/kinectics400 --kinetics-version="400" --lr 0.08 --cache-dataset --sync-bn --amp ``` **Note:** all our models were trained on 8 nodes with 8 V100 GPUs each for a total of 64 GPUs. Expected training time for 64 GPUs is 24 hours, depending on the storage solution. -**Note 2:** hyperparameters for exact replication of our training can be found [here](https://github.com/pytorch/vision/blob/main/torchvision/models/video/README.md). Some hyperparameters such as learning rate are scaled linearly in proportion to the number of GPUs. +**Note 2:** hyperparameters for exact replication of our training can be found on the section below. Some hyperparameters such as learning rate must be scaled linearly in proportion to the number of GPUs. The default values assume 64 GPUs. ### Single GPU @@ -40,3 +40,70 @@ Since the original release, additional versions of Kinetics dataset became avail Our training scripts support these versions of dataset as well by setting the `--kinetics-version` parameter to `"600"`. **Note:** training on Kinetics 600 requires a different set of hyperparameters for optimal performance. We do not provide Kinetics 600 pretrained models. + + +## Video classification models + +Starting with version `0.4.0` we have introduced support for basic video tasks and video classification modelling. +For more information about the available models check [here](https://pytorch.org/docs/stable/torchvision/models.html#video-classification). + +### Video ResNet models + +See reference training script [here](https://github.com/pytorch/vision/blob/main/references/video_classification/train.py): + +- input space: RGB +- resize size: [128, 171] +- crop size: [112, 112] +- mean: [0.43216, 0.394666, 0.37645] +- std: [0.22803, 0.22145, 0.216989] +- number of classes: 400 + +Input data augmentations at training time (with optional parameters): + +1. ConvertImageDtype +2. Resize (resize size value above) +3. Random horizontal flip (0.5) +4. Normalization (mean, std, see values above) +5. Random Crop (crop size value above) +6. Convert BCHW to CBHW + +Input data augmentations at validation time (with optional parameters): + +1. ConvertImageDtype +2. Resize (resize size value above) +3. Normalization (mean, std, see values above) +4. Center Crop (crop size value above) +5. Convert BCHW to CBHW + +This translates in the following set of command-line arguments. Please note that `--batch-size` parameter controls the +batch size per GPU. Moreover note that our default `--lr` is configured for 64 GPUs which is how many we used for the +Video resnet models: +``` +# number of frames per clip +--clip_len 16 \ +# allow for temporal jittering +--clips_per_video 5 \ +--batch-size 24 \ +--epochs 45 \ +--lr 0.64 \ +# we use 10 epochs for linear warmup +--lr-warmup-epochs 10 \ +# learning rate is decayed at 20, 30, and 40 epoch by a factor of 10 +--lr-milestones 20, 30, 40 \ +--lr-gamma 0.1 \ +--train-resize-size 128 171 \ +--train-crop-size 112 112 \ +--val-resize-size 128 171 \ +--val-crop-size 112 112 +``` + +### Additional video modelling resources + +- [Video Model Zoo](https://github.com/facebookresearch/VMZ) +- [PySlowFast](https://github.com/facebookresearch/SlowFast) + +### References + +[0] _D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri_: A Closer Look at Spatiotemporal Convolutions for Action Recognition. _CVPR 2018_ ([paper](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf)) + +[1] _W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman_: The Kinetics Human Action Video Dataset ([paper](https://arxiv.org/abs/1705.06950)) diff --git a/references/video_classification/train.py b/references/video_classification/train.py index 9dff282d4f1..e26231bb914 100644 --- a/references/video_classification/train.py +++ b/references/video_classification/train.py @@ -149,13 +149,18 @@ def main(args): # Data loading code print("Loading data") + val_resize_size = tuple(args.val_resize_size) + val_crop_size = tuple(args.val_crop_size) + train_resize_size = tuple(args.train_resize_size) + train_crop_size = tuple(args.train_crop_size) + traindir = os.path.join(args.data_path, "train") valdir = os.path.join(args.data_path, "val") print("Loading training data") st = time.time() cache_path = _get_cache_path(traindir, args) - transform_train = presets.VideoClassificationPresetTrain(crop_size=(112, 112), resize_size=(128, 171)) + transform_train = presets.VideoClassificationPresetTrain(crop_size=train_crop_size, resize_size=train_resize_size) if args.cache_dataset and os.path.exists(cache_path): print(f"Loading dataset_train from {cache_path}") @@ -192,7 +197,7 @@ def main(args): weights = torchvision.models.get_weight(args.weights) transform_test = weights.transforms() else: - transform_test = presets.VideoClassificationPresetEval(crop_size=(112, 112), resize_size=(128, 171)) + transform_test = presets.VideoClassificationPresetEval(crop_size=val_crop_size, resize_size=val_resize_size) if args.cache_dataset and os.path.exists(cache_path): print(f"Loading dataset_test from {cache_path}") @@ -253,8 +258,7 @@ def main(args): criterion = nn.CrossEntropyLoss() - lr = args.lr * args.world_size - optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=args.momentum, weight_decay=args.weight_decay) + optimizer = torch.optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum, weight_decay=args.weight_decay) scaler = torch.cuda.amp.GradScaler() if args.amp else None # convert scheduler to be per iteration, not per epoch, for warmup that lasts @@ -354,7 +358,7 @@ def get_args_parser(add_help=True): parser.add_argument( "-j", "--workers", default=10, type=int, metavar="N", help="number of data loading workers (default: 10)" ) - parser.add_argument("--lr", default=0.01, type=float, help="initial learning rate") + parser.add_argument("--lr", default=0.64, type=float, help="initial learning rate") parser.add_argument("--momentum", default=0.9, type=float, metavar="M", help="momentum") parser.add_argument( "--wd", @@ -400,6 +404,35 @@ def get_args_parser(add_help=True): parser.add_argument("--world-size", default=1, type=int, help="number of distributed processes") parser.add_argument("--dist-url", default="env://", type=str, help="url used to set up distributed training") + parser.add_argument( + "--val-resize-size", + default=(128, 171), + nargs="+", + type=int, + help="the resize size used for validation (default: (128, 171))", + ) + parser.add_argument( + "--val-crop-size", + default=(112, 112), + nargs="+", + type=int, + help="the central crop size used for validation (default: (112, 112))", + ) + parser.add_argument( + "--train-resize-size", + default=(128, 171), + nargs="+", + type=int, + help="the resize size used for training (default: (128, 171))", + ) + parser.add_argument( + "--train-crop-size", + default=(112, 112), + nargs="+", + type=int, + help="the random crop size used for training (default: (112, 112))", + ) + parser.add_argument("--weights", default=None, type=str, help="the weights enum name to load") # Mixed precision training parameters diff --git a/torchvision/models/video/README.md b/torchvision/models/video/README.md deleted file mode 100644 index 1024534f546..00000000000 --- a/torchvision/models/video/README.md +++ /dev/null @@ -1,60 +0,0 @@ -## Video classification models - -Starting with version `0.4.0` we have introduced support for basic video tasks and video classification modelling. -At the moment, our pretraining consists of base implementation of popular resnet-based video models [0], together with their -basic variant pre-trained on Kinetics400 [1]. Although this is a standard benchmark pre-training, we are always considering what is the best for the community. - -Additional documentation can be found [here](https://pytorch.org/docs/stable/torchvision/models.html#video-classification). - -### Kinetics400 dataset pretraining parameters - -See reference training script [here](https://github.com/pytorch/vision/blob/main/references/video_classification/train.py): - -- input size: [3, 16, 112, 112] -- input space: RGB -- input range: [0, 1] -- mean: [0.43216, 0.394666, 0.37645] -- std: [0.22803, 0.22145, 0.216989] -- number of classes: 400 - -Input data augmentations at training time (with optional parameters): - -0. ToTensor -1. Resize (128, 171) -2. Random horizontal flip (0.5) -3. Normalization (mean, std, see values above) -4. Random Crop (112, 112) - -Input data augmentations at validation time (with optional parameters): - -0. ToTensor -1. Resize (128, 171) -2. Normalization (mean, std, see values above) -3. Center Crop (112, 112) - -This translates in the following set of command-line arguments (please note that learning rate and batch size end up being scaled by the number of GPUs; all our models were trained on 8 nodes with 8 V100 GPUs each for a total of 64 GPUs): -``` -# number of frames per clip ---clip_len 16 \ -# allow for temporal jittering ---clips_per_video 5 \ ---batch-size 24 \ ---epochs 45 \ ---lr 0.01 \ -# we use 10 epochs for linear warmup ---lr-warmup-epochs 10 \ -# learning rate is decayed at 20, 30, and 40 epoch by a factor of 10 ---lr-milestones 20, 30, 40 \ ---lr-gamma 0.1 -``` - -### Additional video modelling resources - -- [Video Model Zoo](https://github.com/facebookresearch/VMZ) -- [PySlowFast](https://github.com/facebookresearch/SlowFast) - -### References - -[0] _D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri_: A Closer Look at Spatiotemporal Convolutions for Action Recognition. _CVPR 2018_ ([paper](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf)) - -[1] _W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman_: The Kinetics Human Action Video Dataset ([paper](https://arxiv.org/abs/1705.06950))