diff --git a/docs/tutorial/inference.md b/docs/tutorial/inference.md index b135a83a3..dbeda7d93 100644 --- a/docs/tutorial/inference.md +++ b/docs/tutorial/inference.md @@ -14,7 +14,7 @@ There are two ways to do inference during training. "evaluate some tensors for each input, and aggregate the results in the end". You can use the `InferenceRunner` interface with some `Inferencer`. This will further support prefetch & data-parallel inference. - + Currently this lacks documentation, but you can refer to examples that uses `InferenceRunner` or custom `Inferencer` to learn more. @@ -55,8 +55,8 @@ predictor = OfflinePredictor(pred_config) output1_array, output2_array = predictor(input1_array, input2_array) ``` -It's __common to use a different graph for inference__, -e.g., use NHWC format, support encoded image format, etc. +It's __common to use a different graph for inference__, +e.g., use NHWC format, support encoded image format, etc. You can make these changes inside the `model` or `tower_func` in your `PredictConfig`. The example in [examples/basics/export-model.py](../examples/basics/export-model.py) demonstrates such an altered inference graph. @@ -90,7 +90,7 @@ you can also save your models into other formats after training, so it may be mo - Removes all unnecessary operations (training-only ops, e.g., learning-rate) to compress the graph. This creates a self-contained graph which includes all necessary information to run inference. - + To load the saved graph, you can simply: ```python graph_def = tf.GraphDef() @@ -116,7 +116,7 @@ training: 1. The model (the graph): you've already written it yourself with TF symbolic functions. Nothing about it is related to the tensorpack interface. - If you use tensorpack layers, they are mainly just wrappers around `tf.layers`. + If you use tensorpack layers, they are not so different from `tf.layers`. 2. The trained parameters: tensorpack saves them in standard TF checkpoint format. Nothing about the format is related to tensorpack. @@ -139,14 +139,16 @@ with TowerContext('', is_training=False): ```eval_rst .. note:: **Do not use metagraph for inference!** - Metagraph is the wrong abstraction for a "model". + Tensorpack saves a metagraph during training. Users should not try to load it for inference. + + Metagraph is the wrong abstraction for a "model". It stores the entire graph which contains not only the mathematical model, but also all the training settings (queues, iterators, summaries, evaluations, multi-gpu replications). Therefore it is usually wrong to import a training metagraph for inference. - It's especially error-prone to load a metagraph on top of a non-empty graph. - The potential name conflicts between the current graph and the nodes in the - metagraph can lead to esoteric bugs or sometimes completely ruin the model. + It's especially error-prone to load a metagraph on top of a non-empty graph. + The potential name conflicts between the current graph and the nodes in the + metagraph can lead to esoteric bugs or sometimes completely ruin the model. It's also very common to change the graph for inference. For example, you may need a different data layout for CPU inference, @@ -161,7 +163,7 @@ with TowerContext('', is_training=False): You can just use `tf.train.Saver` for all the work. Alternatively, use tensorpack's `get_model_loader(path).init(tf.get_default_session())` -Now, you've already built a graph for inference, and the checkpoint is also loaded. +Now, you've already built a graph for inference, and the checkpoint is also loaded. You may now: 1. use `sess.run` to do inference diff --git a/examples/A3C-Gym/README.md b/examples/A3C-Gym/README.md index ae5b0af44..6586666ef 100644 --- a/examples/A3C-Gym/README.md +++ b/examples/A3C-Gym/README.md @@ -28,7 +28,7 @@ Some practicical notes: ### To test a model: -Download models from [model zoo](http://models.tensorpack.com/OpenAIGym/). +Download models from [model zoo](http://models.tensorpack.com/#OpenAIGym). Watch the agent play: `./train-atari.py --task play --env Breakout-v0 --load Breakout-v0.npz` diff --git a/examples/CaffeModels/README.md b/examples/CaffeModels/README.md index c847ec77a..2e5a9cf0a 100644 --- a/examples/CaffeModels/README.md +++ b/examples/CaffeModels/README.md @@ -1,8 +1,7 @@ Example code to convert, load and run inference of some Caffe models. Require caffe python bindings to be installed. -Converted models can also be found at [tensorpack model zoo](http://models.tensorpack.com). - +Converted models can also be found at [tensorpack model zoo](http://models.tensorpack.com/#Caffe-Converted). ## AlexNet: Download: https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet diff --git a/examples/DoReFa-Net/README.md b/examples/DoReFa-Net/README.md index 8f0e46faf..abc52f8fe 100644 --- a/examples/DoReFa-Net/README.md +++ b/examples/DoReFa-Net/README.md @@ -46,7 +46,7 @@ In this implementation, quantized operations are all performed through `tf.float + Look at the docstring in `*-dorefa.py` to see detailed usage and performance. Pretrained model for (1,4,32)-ResNet18 and several AlexNet are available at -[tensorpack model zoo](http://models.tensorpack.com/DoReFa-Net/). +[tensorpack model zoo](http://models.tensorpack.com/#DoReFa-Net). They're provided in the format of numpy dictionary. The __binary-weight 4-bit-activation ResNet-18__ model has 59.2% top-1 validation accuracy. diff --git a/examples/FasterRCNN/README.md b/examples/FasterRCNN/README.md index fb2b8fefb..792b69ecd 100644 --- a/examples/FasterRCNN/README.md +++ b/examples/FasterRCNN/README.md @@ -18,8 +18,8 @@ This is likely the best-performing open source TensorFlow reimplementation of th ## Dependencies + Python 3.3+; OpenCV + TensorFlow ≥ 1.6 -+ pycocotools: `for i in cython 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'; do pip install $i; done` -+ Pre-trained [ImageNet ResNet model](http://models.tensorpack.com/FasterRCNN/) ++ pycocotools/scipy: `for i in cython 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI' scipy; do pip install $i; done` ++ Pre-trained [ImageNet ResNet model](http://models.tensorpack.com/#FasterRCNN) from tensorpack model zoo + [COCO data](http://cocodataset.org/#download). It needs to have the following directory structure: ``` @@ -83,24 +83,24 @@ prediction have to be run with the corresponding configs used in training. These models are trained on train2017 and evaluated on val2017 using mAP@IoU=0.50:0.95. Unless otherwise noted, all models are fine-tuned from ImageNet pre-trained R50/R101 models in -[tensorpack model zoo](http://models.tensorpack.com/FasterRCNN/), +[tensorpack model zoo](http://models.tensorpack.com/#FasterRCNN), using 8 NVIDIA V100s. Performance in [Detectron](https://github.com/facebookresearch/Detectron/) can be reproduced. | Backbone | mAP
(box;mask) | Detectron mAP [1](#ft1)
(box;mask) | Time
(on 8 V100s) | Configurations
(click to expand) | | - | - | - | - | - | - | R50-C4 | 34.1 | | 7.5h |
super quick`MODE_MASK=False FRCNN.BATCH_PER_IM=64`
`PREPROC.TRAIN_SHORT_EDGE_SIZE=600 PREPROC.MAX_SIZE=1024`
`TRAIN.LR_SCHEDULE=[140000,180000,200000]`
| - | R50-C4 | 35.6 | 34.8 | 23h |
standard`MODE_MASK=False`
| - | R50-FPN | 37.5 | 36.7 | 11h |
standard`MODE_MASK=False MODE_FPN=True`
| - | R50-C4 | 36.2;31.8 [:arrow_down:][R50C41x] | 35.8;31.4 | 23.5h |
standardthis is the default, no changes in config needed
| - | R50-FPN | 38.2;34.8 | 37.7;33.9 | 13.5h |
standard`MODE_FPN=True`
| - | R50-FPN | 38.9;35.4 [:arrow_down:][R50FPN2x] | 38.6;34.5 | 25h |
2x`MODE_FPN=True`
`TRAIN.LR_SCHEDULE=2x`
| - | R50-FPN-GN | 40.4;36.3 [:arrow_down:][R50FPN2xGN] | 40.3;35.7 | 31h |
2x+GN`MODE_FPN=True`
`FPN.NORM=GN BACKBONE.NORM=GN`
`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head`
`FPN.MRCNN_HEAD_FUNC=maskrcnn_up4conv_gn_head`
`TRAIN.LR_SCHEDULE=2x` | - | R50-FPN | 41.7;36.2 | | 17h |
+Cascade`MODE_FPN=True FPN.CASCADE=True`
| - | R101-C4 | 40.1;34.6 [:arrow_down:][R101C41x] | | 28h |
standard`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`
| - | R101-FPN | 40.7;36.8 [:arrow_down:][R101FPN1x] | 40.0;35.9 | 18h |
standard`MODE_FPN=True`
`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`
| - | R101-FPN | 46.6;40.3 [:arrow_down:][R101FPN3xCasAug] [2](#ft2) | | 69h |
3x+Cascade+TrainAug`MODE_FPN=True FPN.CASCADE=True`
`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`
`TEST.RESULT_SCORE_THRESH=1e-4`
`PREPROC.TRAIN_SHORT_EDGE_SIZE=[640,800]`
`TRAIN.LR_SCHEDULE=3x`
| + | R50-C4 | 34.1 | | 7h |
super quick`MODE_MASK=False FRCNN.BATCH_PER_IM=64`
`PREPROC.TRAIN_SHORT_EDGE_SIZE=600 PREPROC.MAX_SIZE=1024`
`TRAIN.LR_SCHEDULE=[140000,180000,200000]`
| + | R50-C4 | 35.6 | 34.8 | 22.5h |
standard`MODE_MASK=False`
| + | R50-FPN | 37.5 | 36.7 | 10.5h |
standard`MODE_MASK=False MODE_FPN=True`
| + | R50-C4 | 36.2;31.8 [:arrow_down:][R50C41x] | 35.8;31.4 | 23h |
standardthis is the default, no changes in config needed
| + | R50-FPN | 38.2;34.8 | 37.7;33.9 | 12.5h |
standard`MODE_FPN=True`
| + | R50-FPN | 38.9;35.4 [:arrow_down:][R50FPN2x] | 38.6;34.5 | 24h |
2x`MODE_FPN=True`
`TRAIN.LR_SCHEDULE=2x`
| + | R50-FPN-GN | 40.4;36.3 [:arrow_down:][R50FPN2xGN] | 40.3;35.7 | 29h |
2x+GN`MODE_FPN=True`
`FPN.NORM=GN BACKBONE.NORM=GN`
`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head`
`FPN.MRCNN_HEAD_FUNC=maskrcnn_up4conv_gn_head`
`TRAIN.LR_SCHEDULE=2x` | + | R50-FPN | 41.7;36.2 | | 16h |
+Cascade`MODE_FPN=True FPN.CASCADE=True`
| + | R101-C4 | 40.1;34.6 [:arrow_down:][R101C41x] | | 27h |
standard`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`
| + | R101-FPN | 40.7;36.8 [:arrow_down:][R101FPN1x] | 40.0;35.9 | 17h |
standard`MODE_FPN=True`
`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`
| + | R101-FPN | 46.6;40.3 [:arrow_down:][R101FPN3xCasAug] [2](#ft2) | | 64h |
3x+Cascade+TrainAug`MODE_FPN=True FPN.CASCADE=True`
`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`
`TEST.RESULT_SCORE_THRESH=1e-4`
`PREPROC.TRAIN_SHORT_EDGE_SIZE=[640,800]`
`TRAIN.LR_SCHEDULE=3x`
| | R101-FPN-GN
(From Scratch) | 47.7;41.7 [:arrow_down:][R101FPN9xGNCasAugScratch] [3](#ft3) | 47.4;40.5 | 28h (on 64 V100s) |
9x+GN+Cascade+TrainAug`MODE_FPN=True FPN.CASCADE=True`
`BACKBONE.RESNET_NUM_BLOCKS=[3,4,23,3]`
`FPN.NORM=GN BACKBONE.NORM=GN`
`FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head`
`FPN.MRCNN_HEAD_FUNC=maskrcnn_up4conv_gn_head`
`PREPROC.TRAIN_SHORT_EDGE_SIZE=[640,800]`
`TRAIN.LR_SCHEDULE=9x`
`BACKBONE.FREEZE_AT=0`
| [R50C41x]: http://models.tensorpack.com/FasterRCNN/COCO-MaskRCNN-R50C41x.npz diff --git a/examples/FasterRCNN/config.py b/examples/FasterRCNN/config.py index 70ad7ae3f..508106c40 100644 --- a/examples/FasterRCNN/config.py +++ b/examples/FasterRCNN/config.py @@ -140,7 +140,7 @@ def __ne__(self, _): # Therefore, there is *no need* to modify the config if you only change the number of GPUs. _C.TRAIN.LR_SCHEDULE = "1x" # "1x" schedule in detectron -_C.TRAIN.EVAL_PERIOD = 25 # period (epochs) to run evaluation +_C.TRAIN.EVAL_PERIOD = 50 # period (epochs) to run evaluation _C.TRAIN.CHECKPOINT_PERIOD = 20 # period (epochs) to save model # preprocessing -------------------- diff --git a/examples/GAN/BEGAN.py b/examples/GAN/BEGAN.py index d0ade4d21..2a76285dc 100755 --- a/examples/GAN/BEGAN.py +++ b/examples/GAN/BEGAN.py @@ -17,7 +17,7 @@ Boundary Equilibrium GAN. See the docstring in DCGAN.py for usage. -A pretrained model on CelebA is at http://models.tensorpack.com/GAN/ +A pretrained model on CelebA is at http://models.tensorpack.com/#GAN """ diff --git a/examples/GAN/DCGAN.py b/examples/GAN/DCGAN.py index a558e645c..56a43aa74 100755 --- a/examples/GAN/DCGAN.py +++ b/examples/GAN/DCGAN.py @@ -30,7 +30,7 @@ You can also train on other images (just use any directory of jpg files in `--data`). But you may need to change the preprocessing. -A pretrained model on CelebA is at http://models.tensorpack.com/GAN/ +A pretrained model on CelebA is at http://models.tensorpack.com/#GAN """ diff --git a/examples/GAN/InfoGAN-mnist.py b/examples/GAN/InfoGAN-mnist.py index 6302850ba..1b993a7c9 100755 --- a/examples/GAN/InfoGAN-mnist.py +++ b/examples/GAN/InfoGAN-mnist.py @@ -24,7 +24,7 @@ To visualize: ./InfoGAN-mnist.py --sample --load path/to/model -A pretrained model is at http://models.tensorpack.com/GAN/ +A pretrained model is at http://models.tensorpack.com/#GAN """ BATCH = 128 diff --git a/examples/HED/README.md b/examples/HED/README.md index ed415496d..cc51fdc9f 100644 --- a/examples/HED/README.md +++ b/examples/HED/README.md @@ -33,4 +33,4 @@ To inference (produce a heatmap at each level at out*.png): ```bash ./hed.py --load pretrained.model --run a.jpg ``` -Models I trained can be downloaded [here](http://models.tensorpack.com/HED/). +Models I trained can be downloaded [here](http://models.tensorpack.com/#HED). diff --git a/examples/ImageNetModels/README.md b/examples/ImageNetModels/README.md index 3ad4efb61..a6f070a1f 100644 --- a/examples/ImageNetModels/README.md +++ b/examples/ImageNetModels/README.md @@ -4,7 +4,7 @@ ImageNet training code of ResNet, ShuffleNet, DoReFa-Net, AlexNet, Inception, VG To train any of the models, just do `./{model}.py --data /path/to/ilsvrc`. More options are available in `./{model}.py --help`. Expected format of data directory is described in [docs](http://tensorpack.readthedocs.io/modules/dataflow.dataset.html#tensorpack.dataflow.dataset.ILSVRC12). -Some pretrained models can be downloaded at [tensorpack model zoo](http://models.tensorpack.com/). +Some pretrained models can be downloaded at [tensorpack model zoo](http://models.tensorpack.com/#ImageNetModels). ### ShuffleNet diff --git a/examples/Saliency/README.md b/examples/Saliency/README.md index 8e33f334a..30eb48b6c 100644 --- a/examples/Saliency/README.md +++ b/examples/Saliency/README.md @@ -39,7 +39,7 @@ Usage: ./CAM-resnet.py --data /path/to/imagenet [--load ImageNet-ResNet18-Preact.npz] [--gpu 0,1,2,3] ``` Pretrained and fine-tuned ResNet can be downloaded -in the [model zoo](http://models.tensorpack.com/). +in the [model zoo](http://models.tensorpack.com/#Visualization). 2. Generate CAM on ImageNet validation set: ```bash diff --git a/examples/SpatialTransformer/README.md b/examples/SpatialTransformer/README.md index 275bf9008..484e3dcc1 100644 --- a/examples/SpatialTransformer/README.md +++ b/examples/SpatialTransformer/README.md @@ -20,7 +20,7 @@ To train (takes about 300 epochs to reach 8.8% error): ./mnist-addition.py ``` -To draw the above visualization with [pretrained model](http://models.tensorpack.com/SpatialTransformer/): +To draw the above visualization with [pretrained model](http://models.tensorpack.com/#SpatialTransformer): ```bash ./mnist-addition.py --load mnist-addition.npz --view ``` diff --git a/examples/SuperResolution/README.md b/examples/SuperResolution/README.md index c1137d2b6..96c74e587 100644 --- a/examples/SuperResolution/README.md +++ b/examples/SuperResolution/README.md @@ -35,7 +35,7 @@ python enet-pat.py --vgg19 /path/to/vgg19.npz --data train2017.lmdb Training is highly unstable and does not often give good results. The pretrained model may also fail on different types of images. -You can download and play with the pretrained model [here](http://models.tensorpack.com/SuperResolution/). +You can download and play with the pretrained model [here](http://models.tensorpack.com/#SuperResolution). 3. Inference on an image and output in current directory: diff --git a/tensorpack/graph_builder/training.py b/tensorpack/graph_builder/training.py index a22be6484..a7280f84b 100644 --- a/tensorpack/graph_builder/training.py +++ b/tensorpack/graph_builder/training.py @@ -304,7 +304,7 @@ def build(self, grad_list, get_opt_fn): with tf.name_scope('sync_variables'): post_init_op = SyncMultiGPUReplicatedBuilder.get_post_init_ops() else: - post_init_op = tf.no_op(name='empty_sync_variables') + post_init_op = None return train_op, post_init_op # Adopt from https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/variable_mgr.py diff --git a/tensorpack/train/trainers.py b/tensorpack/train/trainers.py index 476590fe5..0afd52f48 100644 --- a/tensorpack/train/trainers.py +++ b/tensorpack/train/trainers.py @@ -190,13 +190,16 @@ def _setup_graph(self, input, get_cost_fn, get_opt_fn): grad_list = self._builder.call_for_each_tower(tower_fn) self.train_op, post_init_op = self._builder.build(grad_list, get_opt_fn) - cb = RunOp( - post_init_op, - run_before=True, - run_as_trigger=self.BROADCAST_EVERY_EPOCH, - verbose=True) - cb.name_scope = "SyncVariables" - return [cb] + if post_init_op is not None: + cb = RunOp( + post_init_op, + run_before=True, + run_as_trigger=self.BROADCAST_EVERY_EPOCH, + verbose=True) + cb.name_scope = "SyncVariables" + return [cb] + else: + return [] class DistributedTrainerBase(SingleCostTrainer):