New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow 60-80% slower than PyTorch on training Wide ResNet #9322

Closed
taion opened this Issue Apr 20, 2017 · 39 comments

Comments

Projects
None yet
4 participants
@taion
Contributor

taion commented Apr 20, 2017

cc @tfboyd

From #7187 (comment)

On an AWS p2.xlarge, using the tensorflow/tensorflow:1.0.1-devel-gpu Docker image as a base, I see ~270 ms per epoch while training a WRN-16-4 without dropout on CIFAR-10.

Using a PyTorch implementation from https://github.com/xternalz/WideResNet-pytorch, I see instead ~150 ms per epoch for the same.

My implementation of Wide ResNets uses NCHW and fused batch norm. It does use feed_dict for data loading, but I've observed with nvidia-smi that my GPU utilization stays near 100%.


To reproduce:

$ docker build -t dl-papers .
  • Run the Docker image using NVIDIA Docker:
$ nvidia-docker run --rm -it dl-papers /bin/bash
  • Run the TF WRN-16-4 training:
# python -m dl_papers.wide_resnet.train cifar10
  • Observe the logged batch timings, then kill the process.
  • In the same Docker container up the PyTorch Wide ResNet example:
# cd ..
# pip install http://download.pytorch.org/whl/cu80/torch-0.1.11.post5-cp27-none-linux_x86_64.whl
# pip install torchvision tensorboard_logger
# git clone https://github.com/xternalz/WideResNet-pytorch.git
# cd WideResNet-pytorch
  • Run PyTorch training:
# python train.py --dataset cifar10 --layers 16 --widen-factor 4 -p 1
  • Observe logged batch timings.
@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 20, 2017

Contributor

To rule out issues with my use of feed_dict, I've added a benchmark-constant branch to my repo that uses constant tensors for the batch inputs and outputs, and removes all the actual data loading logic.

On this branch, on the same AWS instance, I see batch times of ~246 ms, which is still over 60% slower than what I see with the PyTorch example (and this is non-apples-to-apples, since the PyTorch example is doing real data loading).

Contributor

taion commented Apr 20, 2017

To rule out issues with my use of feed_dict, I've added a benchmark-constant branch to my repo that uses constant tensors for the batch inputs and outputs, and removes all the actual data loading logic.

On this branch, on the same AWS instance, I see batch times of ~246 ms, which is still over 60% slower than what I see with the PyTorch example (and this is non-apples-to-apples, since the PyTorch example is doing real data loading).

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

I doubt I can be helpful on this one. I am having a hard time following all of the code references, e.g. wrappers around everything. I am not doubting it is correct but about about 30 minutes of scanning through the code I know it is beyond me to follow. This is a shot in the dark but if you are trying things to see if they impact the time, I would remove feed_dic completely. You can decrease/decay the learning rate (even on a set schedule) via the optimizer combined with global_step, which is the way I normally see it done. I am not saying that will fix your issue, that is just the only thing I see that sticks out to me. Sorry. Maybe someone else will take a look. I know we are working on a wide&deep model to open source but it may not be your exact one. I will try to check on that, it might also give me an idea of what might be causing you problems. Thank you for sharing.

Member

tfboyd commented Apr 20, 2017

I doubt I can be helpful on this one. I am having a hard time following all of the code references, e.g. wrappers around everything. I am not doubting it is correct but about about 30 minutes of scanning through the code I know it is beyond me to follow. This is a shot in the dark but if you are trying things to see if they impact the time, I would remove feed_dic completely. You can decrease/decay the learning rate (even on a set schedule) via the optimizer combined with global_step, which is the way I normally see it done. I am not saying that will fix your issue, that is just the only thing I see that sticks out to me. Sorry. Maybe someone else will take a look. I know we are working on a wide&deep model to open source but it may not be your exact one. I will try to check on that, it might also give me an idea of what might be causing you problems. Thank you for sharing.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 20, 2017

Contributor

Would it help if I inlined the non-TF layer helpers into a single module? As you can probably tell, we pulled this out from a larger internal repo, where things were organized to optimize code reuse.

If I take out feed_dict entirely, I get ~241 ms. I suspect this is largely from eliding the conditional checks on training, though.

Contributor

taion commented Apr 20, 2017

Would it help if I inlined the non-TF layer helpers into a single module? As you can probably tell, we pulled this out from a larger internal repo, where things were organized to optimize code reuse.

If I take out feed_dict entirely, I get ~241 ms. I suspect this is largely from eliding the conditional checks on training, though.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

Yeah that did not save much. :-( I understand this may sound like I am deflecting the question but how do you know your model is exactly the same as the pytorch version?

Removing the wrappers and inlining seems painful, I would hate to have you do it and then not get time to dig in. This might be a fun side project for me to look at in regards to checking out the timeline. I am also asking internally about a wide&deep model as I would like to start from something where I have more experts to poke. I will try to find time to ask around. I sent an email but I suspect I will need to ask a few people.

Member

tfboyd commented Apr 20, 2017

Yeah that did not save much. :-( I understand this may sound like I am deflecting the question but how do you know your model is exactly the same as the pytorch version?

Removing the wrappers and inlining seems painful, I would hate to have you do it and then not get time to dig in. This might be a fun side project for me to look at in regards to checking out the timeline. I am also asking internally about a wide&deep model as I would like to start from something where I have more experts to poke. I will try to find time to ask around. I sent an email but I suspect I will need to ask a few people.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 20, 2017

Contributor

I believe they're the same. The two models have the same number of parameters, and I can't see anything different on inspection.

Also, the performance delta I see here is broadly consistent with those reported for ResNet-50 and ResNet-56 in the paper linked in the OP on #7187.

Contributor

taion commented Apr 20, 2017

I believe they're the same. The two models have the same number of parameters, and I can't see anything different on inspection.

Also, the performance delta I see here is broadly consistent with those reported for ResNet-50 and ResNet-56 in the paper linked in the OP on #7187.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

One thing to keep in mind and it is something common in most of the code from the HK benchmark. 1) put your input pipeline on the CPU (I fixed this for another simpler CIFAR example and the person got more than a 5x improvement from ~700 imgs/sec to ~3800 imgs/sec on a K80.
tensorflow/models#1264

  1. Queues and following best practices for them. Feed_dict is just for messing around it is very slow. Next week we should release code what is even faster than queues but queues can still get a lot of performance.

I know some of the sample code and published model code did not follow those best practices.

Member

tfboyd commented Apr 20, 2017

One thing to keep in mind and it is something common in most of the code from the HK benchmark. 1) put your input pipeline on the CPU (I fixed this for another simpler CIFAR example and the person got more than a 5x improvement from ~700 imgs/sec to ~3800 imgs/sec on a K80.
tensorflow/models#1264

  1. Queues and following best practices for them. Feed_dict is just for messing around it is very slow. Next week we should release code what is even faster than queues but queues can still get a lot of performance.

I know some of the sample code and published model code did not follow those best practices.

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx Apr 20, 2017

Contributor

Since it's slow even with a constant fake input and 100% GPU util, it's probably not a problem with feed_dict or input, right?

Also I'm able to reproduce the 60% performance difference with my own TF code written fully independently from the above TF code. So it's unlikely to be a problem of model definition, unless both of us interpret the pytorch code in the wrong way.

Slow speed at high GPU util may be a result of using inefficient kernels. For example by setting cudnn.benchmark=False (disable autotuner) I can make the above pytorch code 2x slower, while still taking 100% GPU utilization -- maybe that's something to look at.

Contributor

ppwwyyxx commented Apr 20, 2017

Since it's slow even with a constant fake input and 100% GPU util, it's probably not a problem with feed_dict or input, right?

Also I'm able to reproduce the 60% performance difference with my own TF code written fully independently from the above TF code. So it's unlikely to be a problem of model definition, unless both of us interpret the pytorch code in the wrong way.

Slow speed at high GPU util may be a result of using inefficient kernels. For example by setting cudnn.benchmark=False (disable autotuner) I can make the above pytorch code 2x slower, while still taking 100% GPU utilization -- maybe that's something to look at.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

No idea, someone else will have to pick this up. I know very little and even less about this model. I am sure it will come back around when we focus on these types of models (and thus the ops and kernels they are using) in a couple weeks.

Member

tfboyd commented Apr 20, 2017

No idea, someone else will have to pick this up. I know very little and even less about this model. I am sure it will come back around when we focus on these types of models (and thus the ops and kernels they are using) in a couple weeks.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 20, 2017

Contributor

To clarify:

  • Even when using feed_dict for the data, I'm seeing 95%-100% GPU utilization, so I'm fairly confident that I'm not starving my GPU, even when using feed_dict (data augmentation/&c. are still happening in a separate thread)
  • When I remove all placeholders and just use constant fake training data (i.e. to remove all GPU memory or whatever pressure from moving data around), I still see TF as ~60% slower than the PyTorch model that's doing actual fitting
  • I intentionally wrote this Wide ResNet model from scratch; the ResNet models in tensorflow/models have a number of problems, from inefficiency for the non-TF-Slim versions (using NHWC and not using fused batch norm) to actually implementing the model incorrectly in the TF-Slim versions (not dropping biases on convolutions before batch norms and not using beta and gamma on the batch norm operations, plus the same inefficiencies as above)
Contributor

taion commented Apr 20, 2017

To clarify:

  • Even when using feed_dict for the data, I'm seeing 95%-100% GPU utilization, so I'm fairly confident that I'm not starving my GPU, even when using feed_dict (data augmentation/&c. are still happening in a separate thread)
  • When I remove all placeholders and just use constant fake training data (i.e. to remove all GPU memory or whatever pressure from moving data around), I still see TF as ~60% slower than the PyTorch model that's doing actual fitting
  • I intentionally wrote this Wide ResNet model from scratch; the ResNet models in tensorflow/models have a number of problems, from inefficiency for the non-TF-Slim versions (using NHWC and not using fused batch norm) to actually implementing the model incorrectly in the TF-Slim versions (not dropping biases on convolutions before batch norms and not using beta and gamma on the batch norm operations, plus the same inefficiencies as above)
@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

I know this version also has issues but at a glance if I comment this in (and use a factor of 4 instead of 10 for the widening) and comment the other filter out.

https://github.com/tensorflow/models/blob/master/resnet/resnet_model.py#L87

Yes, I can ask someone on the team, but asking you might save me time and get this moving.

Will that give me the same model in general we are discussing. It is much easier me to "package" this up and get people to look at it than code with a lot of extra "wrappers". I can then get it cleaned up so the community can have a nice version to use. I am pretty sure they team is working on a version of this to go with Estimators and it would be great to flush out any issues sooner rather than later.

I hope I was clear. I am not super experienced with the models. I know a few things and try to help with simple problems. I saw feed_dict and wanted to make sure you did not have a simple problem. The danger is trying to help (I hope to stop in the future) is I end up in a situation where I cannot help directly and people get unfriendly.

Thank you for helping me try to help you.

Edited: to add that I need to adjust the widening.

Member

tfboyd commented Apr 20, 2017

I know this version also has issues but at a glance if I comment this in (and use a factor of 4 instead of 10 for the widening) and comment the other filter out.

https://github.com/tensorflow/models/blob/master/resnet/resnet_model.py#L87

Yes, I can ask someone on the team, but asking you might save me time and get this moving.

Will that give me the same model in general we are discussing. It is much easier me to "package" this up and get people to look at it than code with a lot of extra "wrappers". I can then get it cleaned up so the community can have a nice version to use. I am pretty sure they team is working on a version of this to go with Estimators and it would be great to flush out any issues sooner rather than later.

I hope I was clear. I am not super experienced with the models. I know a few things and try to help with simple problems. I saw feed_dict and wanted to make sure you did not have a simple problem. The danger is trying to help (I hope to stop in the future) is I end up in a situation where I cannot help directly and people get unfriendly.

Thank you for helping me try to help you.

Edited: to add that I need to adjust the widening.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

Here are the results on a GTX 1080 (which I also use as my main display card):

Not apples-to-apples due to TF using a constant. I used git checkout benchmark-constant with the following commands on Ubuntu 14.04 (a custom internal google build) with CUDA 8.0 and 5.1 cuDNN as one would expect:

  • python -m dl_papers.wide_resnet.train cifar10
  • python train.py --dataset cifar10 --layers 16 --widen-factor 4 -p 1

.069 vs .055. I am running TF 1.1rc2 compiled with CUDA and compute 6.1 as well as avx (which I doubt matters in this instance).

TensorFlow with the constant
[2017-04-20 00:15:48,235] INFO in train: epoch 0, batch 176: 0.068s
[2017-04-20 00:15:48,303] INFO in train: epoch 0, batch 177: 0.068s
[2017-04-20 00:15:48,372] INFO in train: epoch 0, batch 178: 0.068s
[2017-04-20 00:15:48,440] INFO in train: epoch 0, batch 179: 0.068s
[2017-04-20 00:15:48,510] INFO in train: epoch 0, batch 180: 0.069s
[2017-04-20 00:15:48,581] INFO in train: epoch 0, batch 181: 0.071s
[2017-04-20 00:15:48,649] INFO in train: epoch 0, batch 182: 0.068s

PyTorch with real data
Epoch: [0][83/391] Time 0.055 (0.069) Loss 1.5680 (1.8102) Prec@1 42.188 (31.343)
Epoch: [0][84/391] Time 0.054 (0.069) Loss 1.6633 (1.8084) Prec@1 39.062 (31.434)
Epoch: [0][85/391] Time 0.053 (0.069) Loss 1.6345 (1.8064) Prec@1 39.844 (31.532)
Epoch: [0][86/391] Time 0.056 (0.069) Loss 1.4338 (1.8021) Prec@1 46.875 (31.708)
Epoch: [0][87/391] Time 0.056 (0.069) Loss 1.4923 (1.7986) Prec@1 43.750 (31.845)
Epoch: [0][88/391] Time 0.057 (0.068) Loss 1.6836 (1.7973) Prec@1 45.312 (31.996)

Toby

Member

tfboyd commented Apr 20, 2017

Here are the results on a GTX 1080 (which I also use as my main display card):

Not apples-to-apples due to TF using a constant. I used git checkout benchmark-constant with the following commands on Ubuntu 14.04 (a custom internal google build) with CUDA 8.0 and 5.1 cuDNN as one would expect:

  • python -m dl_papers.wide_resnet.train cifar10
  • python train.py --dataset cifar10 --layers 16 --widen-factor 4 -p 1

.069 vs .055. I am running TF 1.1rc2 compiled with CUDA and compute 6.1 as well as avx (which I doubt matters in this instance).

TensorFlow with the constant
[2017-04-20 00:15:48,235] INFO in train: epoch 0, batch 176: 0.068s
[2017-04-20 00:15:48,303] INFO in train: epoch 0, batch 177: 0.068s
[2017-04-20 00:15:48,372] INFO in train: epoch 0, batch 178: 0.068s
[2017-04-20 00:15:48,440] INFO in train: epoch 0, batch 179: 0.068s
[2017-04-20 00:15:48,510] INFO in train: epoch 0, batch 180: 0.069s
[2017-04-20 00:15:48,581] INFO in train: epoch 0, batch 181: 0.071s
[2017-04-20 00:15:48,649] INFO in train: epoch 0, batch 182: 0.068s

PyTorch with real data
Epoch: [0][83/391] Time 0.055 (0.069) Loss 1.5680 (1.8102) Prec@1 42.188 (31.343)
Epoch: [0][84/391] Time 0.054 (0.069) Loss 1.6633 (1.8084) Prec@1 39.062 (31.434)
Epoch: [0][85/391] Time 0.053 (0.069) Loss 1.6345 (1.8064) Prec@1 39.844 (31.532)
Epoch: [0][86/391] Time 0.056 (0.069) Loss 1.4338 (1.8021) Prec@1 46.875 (31.708)
Epoch: [0][87/391] Time 0.056 (0.069) Loss 1.4923 (1.7986) Prec@1 43.750 (31.845)
Epoch: [0][88/391] Time 0.057 (0.068) Loss 1.6836 (1.7973) Prec@1 45.312 (31.996)

Toby

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

If I turn on WINOGRAD I get another small bump still using the same branch.

export TF_ENABLE_WINOGRAD_NONFUSED=1;python -m dl_papers.wide_resnet.train cifar10

[2017-04-20 00:37:44,189] INFO in train: epoch 0, batch 192: 0.062s
[2017-04-20 00:37:44,251] INFO in train: epoch 0, batch 193: 0.062s
[2017-04-20 00:37:44,313] INFO in train: epoch 0, batch 194: 0.062s
[2017-04-20 00:37:44,375] INFO in train: epoch 0, batch 195: 0.062s
[2017-04-20 00:37:44,438] INFO in train: epoch 0, batch 196: 0.062s
[2017-04-20 00:37:44,499] INFO in train: epoch 0, batch 197: 0.062s
[2017-04-20 00:37:44,561] INFO in train: epoch 0, batch 198: 0.062s
[2017-04-20 00:37:44,623] INFO in train: epoch 0, batch 199: 0.062s
[2017-04-20 00:37:44,688] INFO in train: epoch 0, batch 200: 0.065s
[2017-04-20 00:37:44,750] INFO in train: epoch 0, batch 201: 0.062s

Member

tfboyd commented Apr 20, 2017

If I turn on WINOGRAD I get another small bump still using the same branch.

export TF_ENABLE_WINOGRAD_NONFUSED=1;python -m dl_papers.wide_resnet.train cifar10

[2017-04-20 00:37:44,189] INFO in train: epoch 0, batch 192: 0.062s
[2017-04-20 00:37:44,251] INFO in train: epoch 0, batch 193: 0.062s
[2017-04-20 00:37:44,313] INFO in train: epoch 0, batch 194: 0.062s
[2017-04-20 00:37:44,375] INFO in train: epoch 0, batch 195: 0.062s
[2017-04-20 00:37:44,438] INFO in train: epoch 0, batch 196: 0.062s
[2017-04-20 00:37:44,499] INFO in train: epoch 0, batch 197: 0.062s
[2017-04-20 00:37:44,561] INFO in train: epoch 0, batch 198: 0.062s
[2017-04-20 00:37:44,623] INFO in train: epoch 0, batch 199: 0.062s
[2017-04-20 00:37:44,688] INFO in train: epoch 0, batch 200: 0.065s
[2017-04-20 00:37:44,750] INFO in train: epoch 0, batch 201: 0.062s

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

On the K80 (AWS p2.xlarge). TF 1.1rc2+ (built from head earlier today) CUDA 8.0, avx2, cuDNN 5.1 and compute 3.7.

WINOGRAD did not help much on K80.

export TF_ENABLE_WINOGRAD_NONFUSED=1;python -m dl_papers.wide_resnet.train cifar10
[2017-04-20 07:52:13,740] INFO in train: epoch 0, batch 91: 0.168s
[2017-04-20 07:52:13,909] INFO in train: epoch 0, batch 92: 0.169s
[2017-04-20 07:52:14,079] INFO in train: epoch 0, batch 93: 0.170s
[2017-04-20 07:52:14,248] INFO in train: epoch 0, batch 94: 0.169s
[2017-04-20 07:52:14,417] INFO in train: epoch 0, batch 95: 0.169s
[2017-04-20 07:52:14,587] INFO in train: epoch 0, batch 96: 0.169s
[2017-04-20 07:52:14,756] INFO in train: epoch 0, batch 97: 0.169s
[2017-04-20 07:52:14,925] INFO in train: epoch 0, batch 98: 0.169s
[2017-04-20 07:52:15,094] INFO in train: epoch 0, batch 99: 0.169s

without WINOGRAD
[2017-04-20 07:53:38,301] INFO in train: epoch 0, batch 40: 0.169s
[2017-04-20 07:53:38,470] INFO in train: epoch 0, batch 41: 0.168s
[2017-04-20 07:53:38,639] INFO in train: epoch 0, batch 42: 0.169s
[2017-04-20 07:53:38,808] INFO in train: epoch 0, batch 43: 0.169s
[2017-04-20 07:53:38,977] INFO in train: epoch 0, batch 44: 0.169s
[2017-04-20 07:53:39,147] INFO in train: epoch 0, batch 45: 0.169s
[2017-04-20 07:53:39,316] INFO in train: epoch 0, batch 46: 0.169s
[2017-04-20 07:53:39,485] INFO in train: epoch 0, batch 47: 0.169s
[2017-04-20 07:53:39,654] INFO in train: epoch 0, batch 48: 0.169s

EDIT: After further testing, I had exported the environment variable and had not to set it back to zero thus the lack of difference.

Member

tfboyd commented Apr 20, 2017

On the K80 (AWS p2.xlarge). TF 1.1rc2+ (built from head earlier today) CUDA 8.0, avx2, cuDNN 5.1 and compute 3.7.

WINOGRAD did not help much on K80.

export TF_ENABLE_WINOGRAD_NONFUSED=1;python -m dl_papers.wide_resnet.train cifar10
[2017-04-20 07:52:13,740] INFO in train: epoch 0, batch 91: 0.168s
[2017-04-20 07:52:13,909] INFO in train: epoch 0, batch 92: 0.169s
[2017-04-20 07:52:14,079] INFO in train: epoch 0, batch 93: 0.170s
[2017-04-20 07:52:14,248] INFO in train: epoch 0, batch 94: 0.169s
[2017-04-20 07:52:14,417] INFO in train: epoch 0, batch 95: 0.169s
[2017-04-20 07:52:14,587] INFO in train: epoch 0, batch 96: 0.169s
[2017-04-20 07:52:14,756] INFO in train: epoch 0, batch 97: 0.169s
[2017-04-20 07:52:14,925] INFO in train: epoch 0, batch 98: 0.169s
[2017-04-20 07:52:15,094] INFO in train: epoch 0, batch 99: 0.169s

without WINOGRAD
[2017-04-20 07:53:38,301] INFO in train: epoch 0, batch 40: 0.169s
[2017-04-20 07:53:38,470] INFO in train: epoch 0, batch 41: 0.168s
[2017-04-20 07:53:38,639] INFO in train: epoch 0, batch 42: 0.169s
[2017-04-20 07:53:38,808] INFO in train: epoch 0, batch 43: 0.169s
[2017-04-20 07:53:38,977] INFO in train: epoch 0, batch 44: 0.169s
[2017-04-20 07:53:39,147] INFO in train: epoch 0, batch 45: 0.169s
[2017-04-20 07:53:39,316] INFO in train: epoch 0, batch 46: 0.169s
[2017-04-20 07:53:39,485] INFO in train: epoch 0, batch 47: 0.169s
[2017-04-20 07:53:39,654] INFO in train: epoch 0, batch 48: 0.169s

EDIT: After further testing, I had exported the environment variable and had not to set it back to zero thus the lack of difference.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

And now with real data on K80 WINOGRAD on and the rest the same, but WINOGRAD did very little on K80.

[2017-04-20 08:03:47,636] INFO in train: epoch 0, batch 383: 0.174s
98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 49152/50000 [01:07<00:01, 739.85ex/s][2017-04-20 08:03:47,810] INFO in train: epoch 0, batch 384: 0.173s
99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 49280/50000 [01:07<00:00, 739.13ex/s][2017-04-20 08:03:47,983] INFO in train: epoch 0, batch 385: 0.173s
99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 49408/50000 [01:07<00:00, 739.33ex/s][2017-04-20 08:03:48,156] INFO in train: epoch 0, batch 386: 0.173s
99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 49536/50000 [01:07<00:00, 739.21ex/s][2017-04-20 08:03:48,330] INFO in train: epoch 0, batch 387: 0.173s
99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 49664/50000 [01:07<00:00, 738.45ex/s][2017-04-20 08:03:48,504] INFO in train: epoch 0, batch 388: 0.173s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 49792/50000 [01:07<00:00, 737.94ex/s][2017-04-20 08:03:48,675] INFO in train: epoch 0, batch 389: 0.172s

I like your graphic with the progress bar. Kind of fun but not so much fun to cut and paste.

EDIT: After further testing, I had exported the environment variable and had not to set it back to zero thus the lack of difference.

Member

tfboyd commented Apr 20, 2017

And now with real data on K80 WINOGRAD on and the rest the same, but WINOGRAD did very little on K80.

[2017-04-20 08:03:47,636] INFO in train: epoch 0, batch 383: 0.174s
98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 49152/50000 [01:07<00:01, 739.85ex/s][2017-04-20 08:03:47,810] INFO in train: epoch 0, batch 384: 0.173s
99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 49280/50000 [01:07<00:00, 739.13ex/s][2017-04-20 08:03:47,983] INFO in train: epoch 0, batch 385: 0.173s
99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 49408/50000 [01:07<00:00, 739.33ex/s][2017-04-20 08:03:48,156] INFO in train: epoch 0, batch 386: 0.173s
99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 49536/50000 [01:07<00:00, 739.21ex/s][2017-04-20 08:03:48,330] INFO in train: epoch 0, batch 387: 0.173s
99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 49664/50000 [01:07<00:00, 738.45ex/s][2017-04-20 08:03:48,504] INFO in train: epoch 0, batch 388: 0.173s
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 49792/50000 [01:07<00:00, 737.94ex/s][2017-04-20 08:03:48,675] INFO in train: epoch 0, batch 389: 0.172s

I like your graphic with the progress bar. Kind of fun but not so much fun to cut and paste.

EDIT: After further testing, I had exported the environment variable and had not to set it back to zero thus the lack of difference.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

Let me know if I did something wrong. I am going to bed.

Member

tfboyd commented Apr 20, 2017

Let me know if I did something wrong. I am going to bed.

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx Apr 20, 2017

Contributor

Actually with TF_ENABLE_WINOGRAD_NONFUSED=1, I saw only 5% difference between pytorch and tensorflow.

My code: https://gist.github.com/ppwwyyxx/43c75cd5a949fc1617be55ded7506f00
It's meant to be an equivalent model of the train.py in pytorch (without other arguments).
I'm on Tesla M40, tensorflow nightly, cuda8.0, cudnn5.1

Contributor

ppwwyyxx commented Apr 20, 2017

Actually with TF_ENABLE_WINOGRAD_NONFUSED=1, I saw only 5% difference between pytorch and tensorflow.

My code: https://gist.github.com/ppwwyyxx/43c75cd5a949fc1617be55ded7506f00
It's meant to be an equivalent model of the train.py in pytorch (without other arguments).
I'm on Tesla M40, tensorflow nightly, cuda8.0, cudnn5.1

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx Apr 20, 2017

Contributor

I managed to get a set of traces from pytorch and tensorflow about what convolution algorithm and shapes they use:
pytorch:

Input: 128 3 32 32 0 Weight: 16 3 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 1
Input: 128 16 32 32 0 Weight: 160 16 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 6
Input: 128 160 32 32 0 Weight: 160 160 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 6
Input: 128 16 32 32 0 Weight: 160 16 1 1 0 pad: 0 0 0 Stride: 1 1 0 -> Algo: 1
Input: 128 160 32 32 0 Weight: 320 160 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 1
Input: 128 320 16 16 0 Weight: 320 320 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 7
Input: 128 160 32 32 0 Weight: 320 160 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 1
Input: 128 320 16 16 0 Weight: 640 320 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 1
Input: 128 320 16 16 0 Weight: 640 320 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 1
Input: 128 640 8 8 0 Weight: 640 640 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 7

bd Input: 128 640 8 8 0 Weight: 640 640 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5
bf Input: 128 640 8 8 0 Weight: 640 640 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5

bd Input: 128 320 16 16 0 Weight: 640 320 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 0
bf Input: 128 320 16 16 0 Weight: 640 320 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 1

bd Input: 128 320 16 16 0 Weight: 640 320 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 0
bf Input: 128 320 16 16 0 Weight: 640 320 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 3

Input: 128 320 16 16 0 Weight: 320 320 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5
Input: 128 320 16 16 0 Weight: 320 320 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5

Input: 128 160 32 32 0 Weight: 320 160 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 0
Input: 128 160 32 32 0 Weight: 320 160 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 0

Input: 128 160 32 32 0 Weight: 320 160 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 0
Input: 128 160 32 32 0 Weight: 320 160 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 3

Input: 128 160 32 32 0 Weight: 160 160 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 4
Input: 128 160 32 32 0 Weight: 160 160 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5

bd Input: 128 16 32 32 0 Weight: 160 16 1 1 0 pad: 0 0 0 Stride: 1 1 0 -> Algo: 1
bf Input: 128 16 32 32 0 Weight: 160 16 1 1 0 pad: 0 0 0 Stride: 1 1 0 -> Algo: 3

bd Input: 128 16 32 32 0 Weight: 160 16 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 4
bf Input: 128 16 32 32 0 Weight: 160 16 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 3

bf Input: 128 3 32 32 0 Weight: 16 3 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 0

TensorFlow:

Conv accepts: 128, 3, (32, 32), 16, (3, 3), (1, 1), (2, 2), 1, 0 -> (1, 0)
Conv accepts: 128, 16, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (6, 0)
Conv accepts: 128, 160, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (6, 0)
Conv accepts: 128, 16, (32, 32), 160, (1, 1), (1, 1), (0, 0), 1, 0 -> (1, 0)
Conv accepts: 128, 160, (33, 33), 320, (3, 3), (2, 2), (1, 1), 1, 0 -> (1, 0)
Conv accepts: 128, 320, (16, 16), 320, (3, 3), (1, 1), (2, 2), 1, 0 -> (7, 0)
Conv accepts: 128, 160, (32, 32), 320, (1, 1), (2, 2), (0, 0), 1, 0 -> (1, 0)
Conv accepts: 128, 320, (16, 16), 640, (1, 1), (2, 2), (0, 0), 1, 0 -> (1, 0)
Conv accepts: 128, 320, (17, 17), 640, (3, 3), (2, 2), (1, 1), 1, 0 -> (1, 0)
Conv accepts: 128, 640, (8, 8), 640, (3, 3), (1, 1), (2, 2), 1, 0 -> (7, 0)

ConvBwdData accepts: 128, 640, (8, 8), 640, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)
ConvBwdFilter accepts: 128, 640, (8, 8), 640, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)

ConvBwdData accepts: 128, 320, (17, 17), 640, (3, 3), (2, 2), (1, 1), 1, 0 -> (0, 0)
ConvBwdFilter accepts: 128, 320, (17, 17), 640, (3, 3), (2, 2), (1, 1), 1, 0 -> (1, 0)

ConvBwdData accepts: 128, 320, (16, 16), 640, (1, 1), (2, 2), (0, 0), 1, 0 -> (0, 0)
ConvBwdFilter accepts: 128, 320, (16, 16), 640, (1, 1), (2, 2), (0, 0), 1, 0 -> (3, 0)

ConvBwdData accepts: 128, 320, (16, 16), 320, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)
ConvBwdFilter accepts: 128, 320, (16, 16), 320, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)

ConvBwdData accepts: 128, 160, (32, 32), 320, (1, 1), (2, 2), (0, 0), 1, 0 -> (0, 0)
ConvBwdFilter accepts: 128, 160, (32, 32), 320, (1, 1), (2, 2), (0, 0), 1, 0 -> (0, 0)

ConvBwdData accepts: 128, 160, (33, 33), 320, (3, 3), (2, 2), (1, 1), 1, 0 -> (0, 0)
ConvBwdFilter accepts: 128, 160, (33, 33), 320, (3, 3), (2, 2), (1, 1), 1, 0 -> (3, 0)

ConvBwdData accepts: 128, 160, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (4, 0)
ConvBwdFilter accepts: 128, 160, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)

ConvBwdData accepts: 128, 16, (32, 32), 160, (1, 1), (1, 1), (0, 0), 1, 0 -> (1, 0)
ConvBwdFilter accepts: 128, 16, (32, 32), 160, (1, 1), (1, 1), (0, 0), 1, 0 -> (3, 0)

ConvBwdData accepts: 128, 16, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (4, 0)
ConvBwdFilter accepts: 128, 16, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (3, 0)

ConvBwdFilter accepts: 128, 3, (32, 32), 16, (3, 3), (1, 1), (2, 2), 1, 0 -> (0, 0)
# batch size, input chan, (input shape), out chan, (kernel shape), (stride), (padding), _, _ -> algo

With TF_ENABLE_WINOGRAD_NONFUSED=1, tensorflow always chooses the same algorithm as pytorch. But the log shows that tensorflow sometimes calls cudnn with irregular input shapes when stride=2. Maybe it can cause some performance issue.

Contributor

ppwwyyxx commented Apr 20, 2017

I managed to get a set of traces from pytorch and tensorflow about what convolution algorithm and shapes they use:
pytorch:

Input: 128 3 32 32 0 Weight: 16 3 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 1
Input: 128 16 32 32 0 Weight: 160 16 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 6
Input: 128 160 32 32 0 Weight: 160 160 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 6
Input: 128 16 32 32 0 Weight: 160 16 1 1 0 pad: 0 0 0 Stride: 1 1 0 -> Algo: 1
Input: 128 160 32 32 0 Weight: 320 160 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 1
Input: 128 320 16 16 0 Weight: 320 320 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 7
Input: 128 160 32 32 0 Weight: 320 160 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 1
Input: 128 320 16 16 0 Weight: 640 320 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 1
Input: 128 320 16 16 0 Weight: 640 320 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 1
Input: 128 640 8 8 0 Weight: 640 640 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 7

bd Input: 128 640 8 8 0 Weight: 640 640 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5
bf Input: 128 640 8 8 0 Weight: 640 640 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5

bd Input: 128 320 16 16 0 Weight: 640 320 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 0
bf Input: 128 320 16 16 0 Weight: 640 320 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 1

bd Input: 128 320 16 16 0 Weight: 640 320 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 0
bf Input: 128 320 16 16 0 Weight: 640 320 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 3

Input: 128 320 16 16 0 Weight: 320 320 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5
Input: 128 320 16 16 0 Weight: 320 320 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5

Input: 128 160 32 32 0 Weight: 320 160 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 0
Input: 128 160 32 32 0 Weight: 320 160 1 1 0 pad: 0 0 0 Stride: 2 2 0 -> Algo: 0

Input: 128 160 32 32 0 Weight: 320 160 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 0
Input: 128 160 32 32 0 Weight: 320 160 3 3 0 pad: 1 1 0 Stride: 2 2 0 -> Algo: 3

Input: 128 160 32 32 0 Weight: 160 160 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 4
Input: 128 160 32 32 0 Weight: 160 160 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 5

bd Input: 128 16 32 32 0 Weight: 160 16 1 1 0 pad: 0 0 0 Stride: 1 1 0 -> Algo: 1
bf Input: 128 16 32 32 0 Weight: 160 16 1 1 0 pad: 0 0 0 Stride: 1 1 0 -> Algo: 3

bd Input: 128 16 32 32 0 Weight: 160 16 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 4
bf Input: 128 16 32 32 0 Weight: 160 16 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 3

bf Input: 128 3 32 32 0 Weight: 16 3 3 3 0 pad: 1 1 0 Stride: 1 1 0 -> Algo: 0

TensorFlow:

Conv accepts: 128, 3, (32, 32), 16, (3, 3), (1, 1), (2, 2), 1, 0 -> (1, 0)
Conv accepts: 128, 16, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (6, 0)
Conv accepts: 128, 160, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (6, 0)
Conv accepts: 128, 16, (32, 32), 160, (1, 1), (1, 1), (0, 0), 1, 0 -> (1, 0)
Conv accepts: 128, 160, (33, 33), 320, (3, 3), (2, 2), (1, 1), 1, 0 -> (1, 0)
Conv accepts: 128, 320, (16, 16), 320, (3, 3), (1, 1), (2, 2), 1, 0 -> (7, 0)
Conv accepts: 128, 160, (32, 32), 320, (1, 1), (2, 2), (0, 0), 1, 0 -> (1, 0)
Conv accepts: 128, 320, (16, 16), 640, (1, 1), (2, 2), (0, 0), 1, 0 -> (1, 0)
Conv accepts: 128, 320, (17, 17), 640, (3, 3), (2, 2), (1, 1), 1, 0 -> (1, 0)
Conv accepts: 128, 640, (8, 8), 640, (3, 3), (1, 1), (2, 2), 1, 0 -> (7, 0)

ConvBwdData accepts: 128, 640, (8, 8), 640, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)
ConvBwdFilter accepts: 128, 640, (8, 8), 640, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)

ConvBwdData accepts: 128, 320, (17, 17), 640, (3, 3), (2, 2), (1, 1), 1, 0 -> (0, 0)
ConvBwdFilter accepts: 128, 320, (17, 17), 640, (3, 3), (2, 2), (1, 1), 1, 0 -> (1, 0)

ConvBwdData accepts: 128, 320, (16, 16), 640, (1, 1), (2, 2), (0, 0), 1, 0 -> (0, 0)
ConvBwdFilter accepts: 128, 320, (16, 16), 640, (1, 1), (2, 2), (0, 0), 1, 0 -> (3, 0)

ConvBwdData accepts: 128, 320, (16, 16), 320, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)
ConvBwdFilter accepts: 128, 320, (16, 16), 320, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)

ConvBwdData accepts: 128, 160, (32, 32), 320, (1, 1), (2, 2), (0, 0), 1, 0 -> (0, 0)
ConvBwdFilter accepts: 128, 160, (32, 32), 320, (1, 1), (2, 2), (0, 0), 1, 0 -> (0, 0)

ConvBwdData accepts: 128, 160, (33, 33), 320, (3, 3), (2, 2), (1, 1), 1, 0 -> (0, 0)
ConvBwdFilter accepts: 128, 160, (33, 33), 320, (3, 3), (2, 2), (1, 1), 1, 0 -> (3, 0)

ConvBwdData accepts: 128, 160, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (4, 0)
ConvBwdFilter accepts: 128, 160, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (5, 0)

ConvBwdData accepts: 128, 16, (32, 32), 160, (1, 1), (1, 1), (0, 0), 1, 0 -> (1, 0)
ConvBwdFilter accepts: 128, 16, (32, 32), 160, (1, 1), (1, 1), (0, 0), 1, 0 -> (3, 0)

ConvBwdData accepts: 128, 16, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (4, 0)
ConvBwdFilter accepts: 128, 16, (32, 32), 160, (3, 3), (1, 1), (2, 2), 1, 0 -> (3, 0)

ConvBwdFilter accepts: 128, 3, (32, 32), 16, (3, 3), (1, 1), (2, 2), 1, 0 -> (0, 0)
# batch size, input chan, (input shape), out chan, (kernel shape), (stride), (padding), _, _ -> algo

With TF_ENABLE_WINOGRAD_NONFUSED=1, tensorflow always chooses the same algorithm as pytorch. But the log shows that tensorflow sometimes calls cudnn with irregular input shapes when stride=2. Maybe it can cause some performance issue.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

When it comes to 5% difference if you are not doing 10 plus tests (stop and start the process) and then on multiple machines it could easily just be an anomaly. Also don't just look at every 10, you need to average it all together correctly. Not doubting the 5% and it could be real but I have been excited about perf improvements and they were just luck of the machine. I have also noticed that if you get a p2.8xlarge and just use 1 GPU it is faster than the p2.xlarge. It could be the CPU difference, it could be a lot of things. Just interesting and not really worth a deep dive.

@ppwwyyxx is this is a bug? I am not sure I follow your sentence. Is TF calling the wrong thing with or without WINOGRAD? Is it wrong and a problem? I got confused between your use of the word always and then sometimes. Can you let me know which version of TF and general hardware you were using when you were getting the 60%-80% difference and the actual numbers even if just estimates?

But the log shows that tensorflow sometimes calls cudnn with irregular input shapes and paddings when stride=2.

Member

tfboyd commented Apr 20, 2017

When it comes to 5% difference if you are not doing 10 plus tests (stop and start the process) and then on multiple machines it could easily just be an anomaly. Also don't just look at every 10, you need to average it all together correctly. Not doubting the 5% and it could be real but I have been excited about perf improvements and they were just luck of the machine. I have also noticed that if you get a p2.8xlarge and just use 1 GPU it is faster than the p2.xlarge. It could be the CPU difference, it could be a lot of things. Just interesting and not really worth a deep dive.

@ppwwyyxx is this is a bug? I am not sure I follow your sentence. Is TF calling the wrong thing with or without WINOGRAD? Is it wrong and a problem? I got confused between your use of the word always and then sometimes. Can you let me know which version of TF and general hardware you were using when you were getting the 60%-80% difference and the actual numbers even if just estimates?

But the log shows that tensorflow sometimes calls cudnn with irregular input shapes and paddings when stride=2.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

If you are doing actual research, I would increase the batch-size (PyTorch as well) unless you think it will harm accuracy. I did not check memory usage but if you want to train as fast as possible that is sometimes overlooked.

Here are also some other flags we use in the benchmarks. They did not seem to have an impact on K80 but I did not extensively.

config = tf.ConfigProto()
config.allow_soft_placement = True
config.intra_op_parallelism_threads = 1
config.inter_op_parallelism_threads = 0 <-- let the system figure it out
config.gpu_options.force_gpu_compatible = True <-- requires master head from the past few days

This env variable might be of interest.
os.environ['TF_AUTOTUNE_THRESHOLD']
It defaults to 1 if I understand the underlying code so there is likely no need to set it but it might be interesting.

Member

tfboyd commented Apr 20, 2017

If you are doing actual research, I would increase the batch-size (PyTorch as well) unless you think it will harm accuracy. I did not check memory usage but if you want to train as fast as possible that is sometimes overlooked.

Here are also some other flags we use in the benchmarks. They did not seem to have an impact on K80 but I did not extensively.

config = tf.ConfigProto()
config.allow_soft_placement = True
config.intra_op_parallelism_threads = 1
config.inter_op_parallelism_threads = 0 <-- let the system figure it out
config.gpu_options.force_gpu_compatible = True <-- requires master head from the past few days

This env variable might be of interest.
os.environ['TF_AUTOTUNE_THRESHOLD']
It defaults to 1 if I understand the underlying code so there is likely no need to set it but it might be interesting.

@cancan101

This comment has been minimized.

Show comment
Hide comment
@cancan101

cancan101 Apr 20, 2017

Contributor

I posted an issue before about TF logging the conv algorithm in use (pytorch has an option for this): #8941

This probably isn't a huge issues, but one issues is that TF will have to transpose the weights on each forward / backward pass: #8287 (and #7187 (comment)).

@Yangqing also had some comments on workspace size (TF_CUDNN_WORKSPACE_LIMIT_IN_MB?): #7187 (comment) that I don't fully understand.

Contributor

cancan101 commented Apr 20, 2017

I posted an issue before about TF logging the conv algorithm in use (pytorch has an option for this): #8941

This probably isn't a huge issues, but one issues is that TF will have to transpose the weights on each forward / backward pass: #8287 (and #7187 (comment)).

@Yangqing also had some comments on workspace size (TF_CUDNN_WORKSPACE_LIMIT_IN_MB?): #7187 (comment) that I don't fully understand.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

@cancan101 Thank you Alex those are some items I can ask about and thank you for linking them in one chunk. I would 100% build TF 1.1 (not the branch just build from master) from source. I have not gone back to see when the algorithm picking (if that is what closed the gap) got better but my guess is toward the end of March. That is a wild guess based on not understanding the code I am reading the the diff history.

Member

tfboyd commented Apr 20, 2017

@cancan101 Thank you Alex those are some items I can ask about and thank you for linking them in one chunk. I would 100% build TF 1.1 (not the branch just build from master) from source. I have not gone back to see when the algorithm picking (if that is what closed the gap) got better but my guess is toward the end of March. That is a wild guess based on not understanding the code I am reading the the diff history.

@tfboyd tfboyd self-assigned this Apr 20, 2017

@cancan101

This comment has been minimized.

Show comment
Hide comment
@cancan101

cancan101 Apr 20, 2017

Contributor

I would also add that the following env var don't seem documented:

  • TF_AUTOTUNE_THRESHOLD
  • TF_ENABLE_WINOGRAD_NONFUSED
  • TF_CUDNN_WORKSPACE_LIMIT_IN_MB
Contributor

cancan101 commented Apr 20, 2017

I would also add that the following env var don't seem documented:

  • TF_AUTOTUNE_THRESHOLD
  • TF_ENABLE_WINOGRAD_NONFUSED
  • TF_CUDNN_WORKSPACE_LIMIT_IN_MB
@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 20, 2017

Contributor

@tfboyd

Sorry about that progress bar. It works a lot better in actual use when we're not logging out timings after every batch. 😛

I'm working on using a modified version of the data pipeline from the tutorial (modified to use 32x32 and NCHW) to get a more apples-to-apples comparison. Though, is feed_dict expected to be slower even if GPU utilization is in the 95%-100% range?

I'll take a look at TF 1.1 RCs and master as well. We've been holding off on that upgrade because of some TensorBoard regressions, but we weren't aware that there was a potential performance boost.

Contributor

taion commented Apr 20, 2017

@tfboyd

Sorry about that progress bar. It works a lot better in actual use when we're not logging out timings after every batch. 😛

I'm working on using a modified version of the data pipeline from the tutorial (modified to use 32x32 and NCHW) to get a more apples-to-apples comparison. Though, is feed_dict expected to be slower even if GPU utilization is in the 95%-100% range?

I'll take a look at TF 1.1 RCs and master as well. We've been holding off on that upgrade because of some TensorBoard regressions, but we weren't aware that there was a potential performance boost.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

@taion I thought your progress bars were cool, they are kind of fun when logging that often. It was 1am when I ran it and my terminal was like a black and white explosion of moving stuff. My first thought was that my scripts need to be much cooler.

Yeah that TensorBoard thing got me as well. I don't want to talk about it. :-)

Apologies again. When I said apples-to-apples, I meant in the first post I was running TF with a constant and PyTorch with real data. Turns out the results were similar. I honestly do not know if the queues will help over feed_dict with the GPU at 100%. I do know that every time I see it I assume it is a problem and it is good practice to just not use the feed_dict approach. I hope I am being clear that I don't know in this instance. I would do it just so I knew it was not a problem. If I was faster with TF, I would "hack" it in and try it.

Thank you for all the feedback Jimmy, you are a nice person. It is really hard to work through this stuff with "strangers" and trying to get on the same page. You likely know way more than I do but I have access to some code and happen to be connected to recent changes.

@cancan101 Yup. Once we are sure WINOGRAD is working in all instances it will just be the default. The AUTOTUNE is not really needed but you could play with it. TF_CUDNN_WORKSPACE_LIMIT_IN_MB is not something I have used. I will include them in the perf guide update next week and the ones we use in the benchmark is embedded in the code. I shared them with this group because all of you were curious and I felt knowledgeable enough (more than me in many ways) to turn them on or off if you have issues, e.g. things don't converge or things "explode".

This is great feedback. Yes, I type too much.

Member

tfboyd commented Apr 20, 2017

@taion I thought your progress bars were cool, they are kind of fun when logging that often. It was 1am when I ran it and my terminal was like a black and white explosion of moving stuff. My first thought was that my scripts need to be much cooler.

Yeah that TensorBoard thing got me as well. I don't want to talk about it. :-)

Apologies again. When I said apples-to-apples, I meant in the first post I was running TF with a constant and PyTorch with real data. Turns out the results were similar. I honestly do not know if the queues will help over feed_dict with the GPU at 100%. I do know that every time I see it I assume it is a problem and it is good practice to just not use the feed_dict approach. I hope I am being clear that I don't know in this instance. I would do it just so I knew it was not a problem. If I was faster with TF, I would "hack" it in and try it.

Thank you for all the feedback Jimmy, you are a nice person. It is really hard to work through this stuff with "strangers" and trying to get on the same page. You likely know way more than I do but I have access to some code and happen to be connected to recent changes.

@cancan101 Yup. Once we are sure WINOGRAD is working in all instances it will just be the default. The AUTOTUNE is not really needed but you could play with it. TF_CUDNN_WORKSPACE_LIMIT_IN_MB is not something I have used. I will include them in the perf guide update next week and the ones we use in the benchmark is embedded in the code. I shared them with this group because all of you were curious and I felt knowledgeable enough (more than me in many ways) to turn them on or off if you have issues, e.g. things don't converge or things "explode".

This is great feedback. Yes, I type too much.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 20, 2017

Contributor

The queue-based data loading from the tutorial appears to be comparable to what I had before with feed_dict.

Using the same setup as above (TF 1.0.1 from the Docker image on a p2.xlarge).

Queue (https://github.com/4Catalyzer/dl-papers/tree/benchmark-queue):

[2017-04-20 15:16:30,871] INFO in train: epoch 0, batch 200: 0.268s
[2017-04-20 15:16:31,141] INFO in train: epoch 0, batch 201: 0.269s
[2017-04-20 15:16:31,416] INFO in train: epoch 0, batch 202: 0.275s
[2017-04-20 15:16:31,691] INFO in train: epoch 0, batch 203: 0.274s
[2017-04-20 15:16:31,966] INFO in train: epoch 0, batch 204: 0.275s
[2017-04-20 15:16:32,239] INFO in train: epoch 0, batch 205: 0.273s
[2017-04-20 15:16:32,513] INFO in train: epoch 0, batch 206: 0.274s
[2017-04-20 15:16:32,791] INFO in train: epoch 0, batch 207: 0.277s
[2017-04-20 15:16:33,066] INFO in train: epoch 0, batch 208: 0.275s
[2017-04-20 15:16:33,335] INFO in train: epoch 0, batch 209: 0.269s

feed_dict (https://github.com/4Catalyzer/dl-papers/tree/benchmark-feed-data-only):

[2017-04-20 15:18:00,192] INFO in train: epoch 0, batch 200: 0.275s
[2017-04-20 15:18:00,470] INFO in train: epoch 0, batch 201: 0.278s
[2017-04-20 15:18:00,751] INFO in train: epoch 0, batch 202: 0.281s
[2017-04-20 15:18:01,029] INFO in train: epoch 0, batch 203: 0.277s
[2017-04-20 15:18:01,309] INFO in train: epoch 0, batch 204: 0.280s
[2017-04-20 15:18:01,583] INFO in train: epoch 0, batch 205: 0.274s
[2017-04-20 15:18:01,859] INFO in train: epoch 0, batch 206: 0.276s
[2017-04-20 15:18:02,138] INFO in train: epoch 0, batch 207: 0.279s
[2017-04-20 15:18:02,414] INFO in train: epoch 0, batch 208: 0.275s
[2017-04-20 15:18:02,690] INFO in train: epoch 0, batch 209: 0.276s

This doesn't say anything on TF v PyTorch, but it's consistent with what we saw earlier in there not being a huge performance gap between feed_dict and queues.

Contributor

taion commented Apr 20, 2017

The queue-based data loading from the tutorial appears to be comparable to what I had before with feed_dict.

Using the same setup as above (TF 1.0.1 from the Docker image on a p2.xlarge).

Queue (https://github.com/4Catalyzer/dl-papers/tree/benchmark-queue):

[2017-04-20 15:16:30,871] INFO in train: epoch 0, batch 200: 0.268s
[2017-04-20 15:16:31,141] INFO in train: epoch 0, batch 201: 0.269s
[2017-04-20 15:16:31,416] INFO in train: epoch 0, batch 202: 0.275s
[2017-04-20 15:16:31,691] INFO in train: epoch 0, batch 203: 0.274s
[2017-04-20 15:16:31,966] INFO in train: epoch 0, batch 204: 0.275s
[2017-04-20 15:16:32,239] INFO in train: epoch 0, batch 205: 0.273s
[2017-04-20 15:16:32,513] INFO in train: epoch 0, batch 206: 0.274s
[2017-04-20 15:16:32,791] INFO in train: epoch 0, batch 207: 0.277s
[2017-04-20 15:16:33,066] INFO in train: epoch 0, batch 208: 0.275s
[2017-04-20 15:16:33,335] INFO in train: epoch 0, batch 209: 0.269s

feed_dict (https://github.com/4Catalyzer/dl-papers/tree/benchmark-feed-data-only):

[2017-04-20 15:18:00,192] INFO in train: epoch 0, batch 200: 0.275s
[2017-04-20 15:18:00,470] INFO in train: epoch 0, batch 201: 0.278s
[2017-04-20 15:18:00,751] INFO in train: epoch 0, batch 202: 0.281s
[2017-04-20 15:18:01,029] INFO in train: epoch 0, batch 203: 0.277s
[2017-04-20 15:18:01,309] INFO in train: epoch 0, batch 204: 0.280s
[2017-04-20 15:18:01,583] INFO in train: epoch 0, batch 205: 0.274s
[2017-04-20 15:18:01,859] INFO in train: epoch 0, batch 206: 0.276s
[2017-04-20 15:18:02,138] INFO in train: epoch 0, batch 207: 0.279s
[2017-04-20 15:18:02,414] INFO in train: epoch 0, batch 208: 0.275s
[2017-04-20 15:18:02,690] INFO in train: epoch 0, batch 209: 0.276s

This doesn't say anything on TF v PyTorch, but it's consistent with what we saw earlier in there not being a huge performance gap between feed_dict and queues.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 20, 2017

Contributor

Turns out I don't need to re-build TF after all. In fact it looks like all I need to do is to set TF_ENABLE_WINOGRAD_NONFUSED=1.

I get comparable numbers to what you report above when using the 1.0.1 Docker image, and when installing 1.1.0rc2 from pip.

Against https://github.com/4Catalyzer/dl-papers/tree/benchmark-queue as above:

[2017-04-20 15:25:47,770] INFO in train: epoch 0, batch 200: 0.162s
[2017-04-20 15:25:47,934] INFO in train: epoch 0, batch 201: 0.164s
[2017-04-20 15:25:48,100] INFO in train: epoch 0, batch 202: 0.165s
[2017-04-20 15:25:48,266] INFO in train: epoch 0, batch 203: 0.166s
[2017-04-20 15:25:48,431] INFO in train: epoch 0, batch 204: 0.164s
[2017-04-20 15:25:48,594] INFO in train: epoch 0, batch 205: 0.164s
[2017-04-20 15:25:48,761] INFO in train: epoch 0, batch 206: 0.167s
[2017-04-20 15:25:48,926] INFO in train: epoch 0, batch 207: 0.164s
[2017-04-20 15:25:49,089] INFO in train: epoch 0, batch 208: 0.163s
[2017-04-20 15:25:49,254] INFO in train: epoch 0, batch 209: 0.165s

I can see the comment in the source code regarding testing, but this seems to make a significant difference in performance when using common model types that heavily use 3x3 convolutions.

How can we get this added to the TF documentation?

Contributor

taion commented Apr 20, 2017

Turns out I don't need to re-build TF after all. In fact it looks like all I need to do is to set TF_ENABLE_WINOGRAD_NONFUSED=1.

I get comparable numbers to what you report above when using the 1.0.1 Docker image, and when installing 1.1.0rc2 from pip.

Against https://github.com/4Catalyzer/dl-papers/tree/benchmark-queue as above:

[2017-04-20 15:25:47,770] INFO in train: epoch 0, batch 200: 0.162s
[2017-04-20 15:25:47,934] INFO in train: epoch 0, batch 201: 0.164s
[2017-04-20 15:25:48,100] INFO in train: epoch 0, batch 202: 0.165s
[2017-04-20 15:25:48,266] INFO in train: epoch 0, batch 203: 0.166s
[2017-04-20 15:25:48,431] INFO in train: epoch 0, batch 204: 0.164s
[2017-04-20 15:25:48,594] INFO in train: epoch 0, batch 205: 0.164s
[2017-04-20 15:25:48,761] INFO in train: epoch 0, batch 206: 0.167s
[2017-04-20 15:25:48,926] INFO in train: epoch 0, batch 207: 0.164s
[2017-04-20 15:25:49,089] INFO in train: epoch 0, batch 208: 0.163s
[2017-04-20 15:25:49,254] INFO in train: epoch 0, batch 209: 0.165s

I can see the comment in the source code regarding testing, but this seems to make a significant difference in performance when using common model types that heavily use 3x3 convolutions.

How can we get this added to the TF documentation?

@cancan101

This comment has been minimized.

Show comment
Hide comment
@cancan101

cancan101 Apr 20, 2017

Contributor

@tfboyd is there an open ticket tracking the progress on (ie making this work):

Once we are sure WINOGRAD is working in all instances it will just be the default.

I created: #9339

Contributor

cancan101 commented Apr 20, 2017

@tfboyd is there an open ticket tracking the progress on (ie making this work):

Once we are sure WINOGRAD is working in all instances it will just be the default.

I created: #9339

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx Apr 20, 2017

Contributor

@tfboyd Currently I don't have other machines to test on, but I ran both several times. TF is always within 5% slower. It's 1.94it/s for TF and around 1.99it/s for pytorch (without extra arguments to change the widen factor, etc). Before enabling the NONFUSED algorithm (cudnn algo No.7, as you can see in the log), TF is about 1.25it/s.
I'm using Tesla M40, E5-2680v3, tensorflow nightly downloaded yesterday, cuda8.0, cudnn5.1. pytorch also uses cudnn5.1

I modified my TF code to use constant input and the results are the same. I'm looking at 10-step average time, but the number is quite stable after the warm-up.

To clarify a bit, my log shows that

  1. although TF and pytorch use different methods to choose cudnn algo, they chose the same algo for all the convolutions in this model.
  2. But in some convolution, they invoke cudnn with different shapes. I think that's because TF padding convention is different from others.
    According to the doc, for Conv32x32 kernel3 stride2 padSAME, TF will pad 0 at left/top and 1 at right/bottom. If I'm not mistaken, cudnn pad in the opposite way. This explains this line from TF VLOG:
Conv accepts: 128, 160, (33, 33), 320, (3, 3), (2, 2), (1, 1), 1, 0 -> (1, 0)

where TF pre-pad the image to become (33,33) and then call cudnn (code here). I'm just wondering could that be a possible performance issue, because 33 doesn't look like a good number for cache.

Contributor

ppwwyyxx commented Apr 20, 2017

@tfboyd Currently I don't have other machines to test on, but I ran both several times. TF is always within 5% slower. It's 1.94it/s for TF and around 1.99it/s for pytorch (without extra arguments to change the widen factor, etc). Before enabling the NONFUSED algorithm (cudnn algo No.7, as you can see in the log), TF is about 1.25it/s.
I'm using Tesla M40, E5-2680v3, tensorflow nightly downloaded yesterday, cuda8.0, cudnn5.1. pytorch also uses cudnn5.1

I modified my TF code to use constant input and the results are the same. I'm looking at 10-step average time, but the number is quite stable after the warm-up.

To clarify a bit, my log shows that

  1. although TF and pytorch use different methods to choose cudnn algo, they chose the same algo for all the convolutions in this model.
  2. But in some convolution, they invoke cudnn with different shapes. I think that's because TF padding convention is different from others.
    According to the doc, for Conv32x32 kernel3 stride2 padSAME, TF will pad 0 at left/top and 1 at right/bottom. If I'm not mistaken, cudnn pad in the opposite way. This explains this line from TF VLOG:
Conv accepts: 128, 160, (33, 33), 320, (3, 3), (2, 2), (1, 1), 1, 0 -> (1, 0)

where TF pre-pad the image to become (33,33) and then call cudnn (code here). I'm just wondering could that be a possible performance issue, because 33 doesn't look like a good number for cache.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

@ppwwyyxx that is good data. I am not super worried about 5% although I would like to close the gap.
I will try to followup on this and get a bunch of the items mentioned above documented.

Member

tfboyd commented Apr 20, 2017

@ppwwyyxx that is good data. I am not super worried about 5% although I would like to close the gap.
I will try to followup on this and get a bunch of the items mentioned above documented.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 20, 2017

Contributor

I'd like to highlight that I think the biggest gap here is in DX. We started migrating from Theano to TensorFlow the first week of this year. This is partly my frustration speaking, but the process looked something like this:

  1. Write naive model implementations, which end up using NHWC and unfused batch norm.
  2. See documentation that suggests NCHW is faster for convolutions (this predates the perf guide). Refactor models to support both NCHW and NHWC, because we still need to support CPU inference. Observe that the models run ~6x slower. Give up on this for a bit.
  3. See documentation that fused batch norm is faster. Switch back from tf.layers.batch_normalization to tf.contrib.layers.batch_norm. This shows a performance improvement.
  4. Try NCHW again, and now see a performance improvement.
    a. Realize earlier issue was due to #7551
  5. See the performance guide. Convert input pipeline to use queues instead of feed_dict. See negligible speedup.
  6. Compare to straightforward PyTorch impl, and observe that the PyTorch impl runs much faster.
  7. Learn through this issue that we need to enable an undocumented environment flag to access the fastest cuDNN mode for 3x3 convolutions.

As a developer, this is a really suboptimal experience.

My impression is that the modal TF example or published code is at something like our step (1) above, in that it uses NHWC, uses unfused batch norm, and doesn't enable the non-fused Winograd convolution. Correspondingly, performance is quite far from optimal.

By contrast, though with a smaller sample size, the PyTorch examples I've seen generally seem to do "the right thing" performance-wise, and seem to run quickly out-of-the-box. (Also, the status of the built-in PyTorch layer API makes separate PyTorch examples far more consistent in terms of how the code reads.)

I'm very grateful for your help in tracking down these issues, but I really wish the out-of-the-box experience were better, and that it didn't take so much work to get to this point.

Contributor

taion commented Apr 20, 2017

I'd like to highlight that I think the biggest gap here is in DX. We started migrating from Theano to TensorFlow the first week of this year. This is partly my frustration speaking, but the process looked something like this:

  1. Write naive model implementations, which end up using NHWC and unfused batch norm.
  2. See documentation that suggests NCHW is faster for convolutions (this predates the perf guide). Refactor models to support both NCHW and NHWC, because we still need to support CPU inference. Observe that the models run ~6x slower. Give up on this for a bit.
  3. See documentation that fused batch norm is faster. Switch back from tf.layers.batch_normalization to tf.contrib.layers.batch_norm. This shows a performance improvement.
  4. Try NCHW again, and now see a performance improvement.
    a. Realize earlier issue was due to #7551
  5. See the performance guide. Convert input pipeline to use queues instead of feed_dict. See negligible speedup.
  6. Compare to straightforward PyTorch impl, and observe that the PyTorch impl runs much faster.
  7. Learn through this issue that we need to enable an undocumented environment flag to access the fastest cuDNN mode for 3x3 convolutions.

As a developer, this is a really suboptimal experience.

My impression is that the modal TF example or published code is at something like our step (1) above, in that it uses NHWC, uses unfused batch norm, and doesn't enable the non-fused Winograd convolution. Correspondingly, performance is quite far from optimal.

By contrast, though with a smaller sample size, the PyTorch examples I've seen generally seem to do "the right thing" performance-wise, and seem to run quickly out-of-the-box. (Also, the status of the built-in PyTorch layer API makes separate PyTorch examples far more consistent in terms of how the code reads.)

I'm very grateful for your help in tracking down these issues, but I really wish the out-of-the-box experience were better, and that it didn't take so much work to get to this point.

@taion taion changed the title from TensorFlow ~80% slower than PyTorch on training Wide ResNet to TensorFlow 60-80% slower than PyTorch on training Wide ResNet Apr 20, 2017

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

I feel your pain.

On the plus side TF 1.1 gives you very similar performance of the "secret flag" that I should not have shared. One step at a time. I do not want to argue your points as I do understand the frustration. On the feed_dict issue, even if it was not helpful to remove in this instance it is not normally a good practice for scaling. Thank you for the feedback. I consider this item closed and I will try to followup on the requests and items in the various comments.

Member

tfboyd commented Apr 20, 2017

I feel your pain.

On the plus side TF 1.1 gives you very similar performance of the "secret flag" that I should not have shared. One step at a time. I do not want to argue your points as I do understand the frustration. On the feed_dict issue, even if it was not helpful to remove in this instance it is not normally a good practice for scaling. Thank you for the feedback. I consider this item closed and I will try to followup on the requests and items in the various comments.

@tfboyd tfboyd closed this Apr 20, 2017

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 20, 2017

Contributor

@tfboyd

Can you double-check the timings you got against master without TF_ENABLE_WINOGRAD_NONFUSED=1?

I just built TF by hand against master, with compute 3.7 and all the CPU stuff enabled, and without setting TF_ENABLE_WINOGRAD_NONFUSED=1, I still see:

[2017-04-20 20:27:27,056] INFO in train: epoch 0, batch 200: 0.284s
[2017-04-20 20:27:27,340] INFO in train: epoch 0, batch 201: 0.284s
[2017-04-20 20:27:27,616] INFO in train: epoch 0, batch 202: 0.276s
[2017-04-20 20:27:27,898] INFO in train: epoch 0, batch 203: 0.282s
[2017-04-20 20:27:28,175] INFO in train: epoch 0, batch 204: 0.277s

With TF_ENABLE_WINOGRAD_NONFUSED=1, I see as expected:

[2017-04-20 20:29:37,525] INFO in train: epoch 0, batch 200: 0.170s
[2017-04-20 20:29:37,695] INFO in train: epoch 0, batch 201: 0.169s
[2017-04-20 20:29:37,861] INFO in train: epoch 0, batch 202: 0.166s
[2017-04-20 20:29:38,028] INFO in train: epoch 0, batch 203: 0.166s
[2017-04-20 20:29:38,195] INFO in train: epoch 0, batch 204: 0.167s

Naively this makes sense to me. The non-fused Winograd operation should still be the fastest for 3x3 convolutions, so it'd be odd to see a performance gain without enabling it.

Contributor

taion commented Apr 20, 2017

@tfboyd

Can you double-check the timings you got against master without TF_ENABLE_WINOGRAD_NONFUSED=1?

I just built TF by hand against master, with compute 3.7 and all the CPU stuff enabled, and without setting TF_ENABLE_WINOGRAD_NONFUSED=1, I still see:

[2017-04-20 20:27:27,056] INFO in train: epoch 0, batch 200: 0.284s
[2017-04-20 20:27:27,340] INFO in train: epoch 0, batch 201: 0.284s
[2017-04-20 20:27:27,616] INFO in train: epoch 0, batch 202: 0.276s
[2017-04-20 20:27:27,898] INFO in train: epoch 0, batch 203: 0.282s
[2017-04-20 20:27:28,175] INFO in train: epoch 0, batch 204: 0.277s

With TF_ENABLE_WINOGRAD_NONFUSED=1, I see as expected:

[2017-04-20 20:29:37,525] INFO in train: epoch 0, batch 200: 0.170s
[2017-04-20 20:29:37,695] INFO in train: epoch 0, batch 201: 0.169s
[2017-04-20 20:29:37,861] INFO in train: epoch 0, batch 202: 0.166s
[2017-04-20 20:29:38,028] INFO in train: epoch 0, batch 203: 0.166s
[2017-04-20 20:29:38,195] INFO in train: epoch 0, batch 204: 0.167s

Naively this makes sense to me. The non-fused Winograd operation should still be the fastest for 3x3 convolutions, so it'd be odd to see a performance gain without enabling it.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 20, 2017

Member

yeah I will. It was late at night and it is possible I exported the variable and did not clear it. I will include the sha-hash as well.

Member

tfboyd commented Apr 20, 2017

yeah I will. It was late at night and it is possible I exported the variable and did not clear it. I will include the sha-hash as well.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 21, 2017

Member

@taion Hey, I want to do a short write up for tensorflow-discuss covering your model and the flags we used to improve performance. I know you are frustrated by the churn and these Flags are far from your or my ideal of "awesome". That said and understood, I want to provide others with these tips so they can get improved performance while we continue to bring the feature into the core. I did not want to reference your Github without your permission. I also would like to reference your github username as well if that is ok with you. Obviously I need to work out if WINOGRAD is still needed as a FLAG in TF 1.1. I ran out of time today. I need to spin up my AWS instance (from an image) again and pull down the repos.

Edited for commas.

Member

tfboyd commented Apr 21, 2017

@taion Hey, I want to do a short write up for tensorflow-discuss covering your model and the flags we used to improve performance. I know you are frustrated by the churn and these Flags are far from your or my ideal of "awesome". That said and understood, I want to provide others with these tips so they can get improved performance while we continue to bring the feature into the core. I did not want to reference your Github without your permission. I also would like to reference your github username as well if that is ok with you. Obviously I need to work out if WINOGRAD is still needed as a FLAG in TF 1.1. I ran out of time today. I need to spin up my AWS instance (from an image) again and pull down the repos.

Edited for commas.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 21, 2017

Contributor

Thanks again for your help.

Feel free to reference our repo. We'd been meaning to open source some of our paper repros anyway, since there didn't seem to be many TF examples in the wild that supported flexible data_format and used fused batch norm.

I personally don't mind churn at all; if anything I'd be quite happy to see e.g. defaults updated so that the fast way and the default way were the same.

Contributor

taion commented Apr 21, 2017

Thanks again for your help.

Feel free to reference our repo. We'd been meaning to open source some of our paper repros anyway, since there didn't seem to be many TF examples in the wild that supported flexible data_format and used fused batch norm.

I personally don't mind churn at all; if anything I'd be quite happy to see e.g. defaults updated so that the fast way and the default way were the same.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion Apr 21, 2017

Contributor

Feel free to reference me and/or the repo, I mean.

Contributor

taion commented Apr 21, 2017

Feel free to reference me and/or the repo, I mean.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd Apr 21, 2017

Member

Cool and thank you.

Member

tfboyd commented Apr 21, 2017

Cool and thank you.

@tfboyd

This comment has been minimized.

Show comment
Hide comment
@tfboyd

tfboyd May 8, 2017

Member

@taion I am finally back to doing a write up and I hope to send it this week but I suspect that means next week. Also I confirmed what you saw. Regardless of version, even the latest head of master, you need WINOGRAD on for the boost. I am sure we have an internal bug to make that on by default or on when it makes sense, but I am going to start obnoxiously asking about it.

Member

tfboyd commented May 8, 2017

@taion I am finally back to doing a write up and I hope to send it this week but I suspect that means next week. Also I confirmed what you saw. Regardless of version, even the latest head of master, you need WINOGRAD on for the boost. I am sure we have an internal bug to make that on by default or on when it makes sense, but I am going to start obnoxiously asking about it.

@taion

This comment has been minimized.

Show comment
Hide comment
@taion

taion May 8, 2017

Contributor

Thanks, looking forward to it. We've since done some of that benchmarking on Pascal GPUs and verified that enabling non-fused Winograd wasn't as big a deal here, but it'd be great to see that massive speedup on 3x3 convolutions by default on Kepler. 3x3 convolutions are extremely common, and most cloud vendors are still only providing Kepler hardware.

Contributor

taion commented May 8, 2017

Thanks, looking forward to it. We've since done some of that benchmarking on Pascal GPUs and verified that enabling non-fused Winograd wasn't as big a deal here, but it'd be great to see that massive speedup on 3x3 convolutions by default on Kepler. 3x3 convolutions are extremely common, and most cloud vendors are still only providing Kepler hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment