Skip to content
Branch: master
Find file History
tensorflower-gardener Using better version of l2 loss to avoid reshape, concat and split ops.
Also adding support for CTL mode in the borg file.

PiperOrigin-RevId: 284075404
Latest commit c0bdb37 Dec 6, 2019

README.md

Image Classification

This folder contains the TF 2.0 model examples for image classification:

For more information about other types of models, please refer to this README file.

ResNet

Similar to the estimator implementation, the Keras implementation has code for both CIFAR-10 data and ImageNet data. The CIFAR-10 version uses a ResNet56 model implemented in resnet_cifar_model.py, and the ImageNet version uses a ResNet50 model implemented in resnet_model.py.

To use either dataset, make sure that you have the latest version of TensorFlow installed and add the models folder to your Python path.

Pretrained Models

CIFAR-10

Download and extract the CIFAR-10 data. You can use the following script:

python ../../r1/resnet/cifar10_download_and_extract.py

After you download the data, you can run the program by:

python resnet_cifar_main.py

If you did not use the default directory to download the data, specify the location with the --data_dir flag, like:

python resnet_cifar_main.py --data_dir=/path/to/cifar

ImageNet Training

Download the ImageNet dataset and convert it to TFRecord format. The following script and README provide a few options.

Once your dataset is ready, you can begin training the model as follows:

python resnet_imagenet_main.py

Again, if you did not download the data to the default directory, specify the location with the --data_dir flag:

python resnet_imagenet_main.py --data_dir=/path/to/imagenet

There are more flag options you can specify. Here are some examples:

  • --use_synthetic_data: when set to true, synthetic data, rather than real data, are used;
  • --batch_size: the batch size used for the model;
  • --model_dir: the directory to save the model checkpoint;
  • --train_epochs: number of epoches to run for training the model;
  • --train_steps: number of steps to run for training the model. We now only support a number that is smaller than the number of batches in an epoch.
  • --skip_eval: when set to true, evaluation as well as validation during training is skipped

For example, this is a typical command line to run with ImageNet data with batch size 128 per GPU:

python -m resnet_imagenet_main \
    --model_dir=/tmp/model_dir/something \
    --num_gpus=2 \
    --batch_size=128 \
    --train_epochs=90 \
    --train_steps=10 \
    --use_synthetic_data=false

See common.py for full list of options.

Using multiple GPUs

You can train these models on multiple GPUs using tf.distribute.Strategy API. You can read more about them in this guide.

In this example, we have made it easier to use is with just a command line flag --num_gpus. By default this flag is 1 if TensorFlow is compiled with CUDA, and 0 otherwise.

  • --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device.
  • --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device.
  • --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous distributed training across the GPUs.

If you wish to run without tf.distribute.Strategy, you can do so by setting --distribution_strategy=off.

Running on Cloud TPUs

Note: This model will not work with TPUs on Colab.

You can train the ResNet CTL model on Cloud TPUs using tf.distribute.TPUStrategy. If you are not familiar with Cloud TPUs, it is strongly recommended that you go through the quickstart to learn how to create a TPU and GCE VM.

To run ResNet model on a TPU, you must set --distribution_strategy=tpu and --tpu=$TPU_NAME, where $TPU_NAME the name of your TPU in the Cloud Console. From a GCE VM, you can run the following command to train ResNet for one epoch on a v2-8 or v3-8 TPU:

python resnet_ctl_imagenet_main.py \
  --tpu=$TPU_NAME \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --batch_size=1024 \
  --steps_per_loop=500 \
  --train_epochs=1 \
  --use_synthetic_data=false \
  --dtype=fp32 \
  --enable_eager=true \
  --enable_tensorboard=true \
  --distribution_strategy=tpu \
  --log_steps=50 \
  --single_l2_loss_op=true \
  --use_tf_function=true

To train the ResNet to convergence, run it for 90 epochs:

python resnet_ctl_imagenet_main.py \
  --tpu=$TPU_NAME \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --batch_size=1024 \
  --steps_per_loop=500 \
  --train_epochs=90 \
  --use_synthetic_data=false \
  --dtype=fp32 \
  --enable_eager=true \
  --enable_tensorboard=true \
  --distribution_strategy=tpu \
  --log_steps=50 \
  --single_l2_loss_op=true \
  --use_tf_function=true

Note: $MODEL_DIR and $DATA_DIR must be GCS paths.

MNIST

To download the data and run the MNIST sample model locally for the first time, run one of the following command:

python mnist_main.py \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --train_epochs=10 \
  --distribution_strategy=one_device \
  --num_gpus=$NUM_GPUS \
  --download

To train the model on a Cloud TPU, run the following command:

python mnist_main.py \
  --tpu=$TPU_NAME \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --train_epochs=10 \
  --distribution_strategy=tpu \
  --download

Note: the --download flag is only required the first time you run the model.

You can’t perform that action at this time.