New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone graph trained using on NCHW data_format giving errors on CPU when testing (running forward pass) #13146

Closed
tumusudheer opened this Issue Sep 19, 2017 · 22 comments

Comments

Projects
None yet
5 participants
@tumusudheer

tumusudheer commented Sep 19, 2017

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes,
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 14.04
  • TensorFlow installed from (source or binary): Binary (.whl file)
  • TensorFlow version (use command below): 1.2.0-rc1
  • Python version: 2.7
  • Bazel version (if compiling from source): N/A
  • CUDA/cuDNN version:
    CUDA 8.0.61
    CuDNN 5.1.10
  • GPU model and memory: GeForce GTX 1080
  • Exact command to reproduce:

Describe the problem

I've a standalone graph create by freezing my model/net which has a few slim.conv2d, slim.max_pool2d operations defined with NCHW data format. I've used freeze_graph.py utility to create the stand alone graph using a trained checkpoint (trained using NCHW format). When I tested the graph (ran forward pass) on a Ubuntu machine with GPU, it was running very well. But when I ran the same code to test (forward pass) with the same stand alone graph a Ubuntu machine that has no GPU but CPU, I got the following errors:

2017-09-18 17:57:48.890761: E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Invalid argument: Default MaxPoolingOp only supports NHWC.
         [[Node: text_box_300/pool1/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 2, 2], padding="SAME", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/cpu:0"](text_box_300/conv1/conv1_2/Relu)]]
2017-09-18 17:57:48.892768: E main_textd_test.cc:457] Running model failed: Invalid argument: Default MaxPoolingOp only supports NHWC.
         [[Node: text_box_300/pool1/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 2, 2], padding="SAME", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/cpu:0"](text_box_300/conv1/conv1_2/Relu)]]

I believe the errors are related to #2660
Is it possible to modify the stand alone graph that is running well on GPU to make it run successfully on CPU machine ? Is it possible to avoid retraining the model using NHWC format and recreate the standalone model ?

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

@aselle

This comment has been minimized.

Show comment
Hide comment
@aselle

aselle Sep 21, 2017

Member

@petewarden, is there any easy way to handle this. In principle it should be possible to rewrite the graph and transpose all the weight variables so that it can operate in NHWC. Perhaps Grappler could help?

Member

aselle commented Sep 21, 2017

@petewarden, is there any easy way to handle this. In principle it should be possible to rewrite the graph and transpose all the weight variables so that it can operate in NHWC. Perhaps Grappler could help?

@petewarden

This comment has been minimized.

Show comment
Hide comment
@petewarden

petewarden Sep 26, 2017

Member

We don't currently have a way to handle this. In theory it should be possible to write a Graph Transform Tool rule to do the swizzling, or maybe in Grappler, but we'd need to maintain a list of all the ops this applies to. We don't have any plans to work on this in the near term unfortunately, but PRs would be welcome.

Member

petewarden commented Sep 26, 2017

We don't currently have a way to handle this. In theory it should be possible to write a Graph Transform Tool rule to do the swizzling, or maybe in Grappler, but we'd need to maintain a list of all the ops this applies to. We don't have any plans to work on this in the near term unfortunately, but PRs would be welcome.

@tumusudheer

This comment has been minimized.

Show comment
Hide comment
@tumusudheer

tumusudheer Sep 26, 2017

Hi @petewarden ,

Thank you very much for your response. Is this the correct graph transform tool you are referring ? Can you tell me briefly (steps involved) on how to add/write rule to do graph swizzling using the tool or (using Grappler which ever is easier in your opinion) so I can work on it by myself.

Also there is no documentation on what is Grappler or how to use Grappler ( at least I didn't find any). I hope this is the grappler tool you are referring to, correct me If I'm wrong. Can you point me if there is any documentation on what is the purpose of Grappler and how to use it.

Thank you.

tumusudheer commented Sep 26, 2017

Hi @petewarden ,

Thank you very much for your response. Is this the correct graph transform tool you are referring ? Can you tell me briefly (steps involved) on how to add/write rule to do graph swizzling using the tool or (using Grappler which ever is easier in your opinion) so I can work on it by myself.

Also there is no documentation on what is Grappler or how to use Grappler ( at least I didn't find any). I hope this is the grappler tool you are referring to, correct me If I'm wrong. Can you point me if there is any documentation on what is the purpose of Grappler and how to use it.

Thank you.

@aselle

This comment has been minimized.

Show comment
Hide comment
@aselle

aselle Sep 26, 2017

Member

@zhangyaobit, this may be a major issue if we enable grappler by default, and there is no way to rewrite the checkpoints and graph to be NCHW.

Member

aselle commented Sep 26, 2017

@zhangyaobit, this may be a major issue if we enable grappler by default, and there is no way to rewrite the checkpoints and graph to be NCHW.

@tumusudheer

This comment has been minimized.

Show comment
Hide comment
@tumusudheer

tumusudheer Sep 26, 2017

Hi @aselle ,

I agree that it is a major issue if there is no way to rewrite graph and checkpoints from NCHW to NHWC and NHWC to NCHW. The tensorflow (v 1.2) official documentation recommended to train graph NCHW and do inference on NHWC here:

The best practice is to build models that work with both NCHW and NHWC as it is common to train using NCHW on GPU, and then do inference with NHWC on CPU.

tumusudheer commented Sep 26, 2017

Hi @aselle ,

I agree that it is a major issue if there is no way to rewrite graph and checkpoints from NCHW to NHWC and NHWC to NCHW. The tensorflow (v 1.2) official documentation recommended to train graph NCHW and do inference on NHWC here:

The best practice is to build models that work with both NCHW and NHWC as it is common to train using NCHW on GPU, and then do inference with NHWC on CPU.

@zhangyaobit

This comment has been minimized.

Show comment
Hide comment
@zhangyaobit

zhangyaobit Sep 27, 2017

Member

@tumusudheer, note that "MaxPoolingOp only supports NHWC" on the CPU, so could you run NHWC (as opposed to NCHW) if running on CPU?

I think the checkpoint should be compatible (you can train in NCHW and infer in NHWC), because weights/filter input are always stored in the same format ([filter_height, filter_width, in_channels, out_channels] regardless of whether NCHW or NHWC is used: https://www.tensorflow.org/api_docs/python/tf/nn/conv2d).

(for your problem, there seems no need to use graph transformation tool or Grappler, you just need to set the format to NHWC if infer on CPU).

Member

zhangyaobit commented Sep 27, 2017

@tumusudheer, note that "MaxPoolingOp only supports NHWC" on the CPU, so could you run NHWC (as opposed to NCHW) if running on CPU?

I think the checkpoint should be compatible (you can train in NCHW and infer in NHWC), because weights/filter input are always stored in the same format ([filter_height, filter_width, in_channels, out_channels] regardless of whether NCHW or NHWC is used: https://www.tensorflow.org/api_docs/python/tf/nn/conv2d).

(for your problem, there seems no need to use graph transformation tool or Grappler, you just need to set the format to NHWC if infer on CPU).

@tumusudheer

This comment has been minimized.

Show comment
Hide comment
@tumusudheer

tumusudheer Sep 27, 2017

Hi @zhangyaobit ,

Thank you very much for your response. For running inference on CPU, I created my graph such that all operations (conv2D and MaxPool) take NHWC data instead of NCHW. But when I'm restoring the checkpoint (which I obtained by training the graph using NCHW format on GPUs), I'm getting the following errors:

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [28] rhs shape= [3]
         [[Node: save/Assign_16 = Assign[T=DT_FLOAT, _class=["loc:@text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean, save/RestoreV2_16/_23)]]

I've attached complete error log in case you need (I've attached the code as well).

create_stand_alone.py.txt

These are the steps I've followed as part of creating stand alone graph/model to run inference on CPUS:

  1. Created graph with all Ops supporting NHWC
  2. Restoring the checkpoint with weights (which has been trained on GPU by using graph with NCHW Ops)
  3. Save the final graph as .pb file

I'm working on the this model and was able to successfully run inference on GPUs but not on CPUs because of NHWC data format issue

error.txt

It seems the cause of the error is tensorflow not able to assign the weights that have been restored from NCHW graph checkpoint to NHWC graph variables. But when the checkpoint contains weights in the same format irrespective of data_format (as per your previous post), I should not get the error for dimension mis-match while assigning weights to graph variables. Am I missing something here or doing any thing wrong ? Can you please let me know if I'm doing any basic mistake or doing any thing wrong ? I can also provide my checkpoint files in case you need them.

Thanks in advance.

tumusudheer commented Sep 27, 2017

Hi @zhangyaobit ,

Thank you very much for your response. For running inference on CPU, I created my graph such that all operations (conv2D and MaxPool) take NHWC data instead of NCHW. But when I'm restoring the checkpoint (which I obtained by training the graph using NCHW format on GPUs), I'm getting the following errors:

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [28] rhs shape= [3]
         [[Node: save/Assign_16 = Assign[T=DT_FLOAT, _class=["loc:@text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean, save/RestoreV2_16/_23)]]

I've attached complete error log in case you need (I've attached the code as well).

create_stand_alone.py.txt

These are the steps I've followed as part of creating stand alone graph/model to run inference on CPUS:

  1. Created graph with all Ops supporting NHWC
  2. Restoring the checkpoint with weights (which has been trained on GPU by using graph with NCHW Ops)
  3. Save the final graph as .pb file

I'm working on the this model and was able to successfully run inference on GPUs but not on CPUs because of NHWC data format issue

error.txt

It seems the cause of the error is tensorflow not able to assign the weights that have been restored from NCHW graph checkpoint to NHWC graph variables. But when the checkpoint contains weights in the same format irrespective of data_format (as per your previous post), I should not get the error for dimension mis-match while assigning weights to graph variables. Am I missing something here or doing any thing wrong ? Can you please let me know if I'm doing any basic mistake or doing any thing wrong ? I can also provide my checkpoint files in case you need them.

Thanks in advance.

@zhangyaobit

This comment has been minimized.

Show comment
Hide comment
@zhangyaobit

zhangyaobit Sep 27, 2017

Member

What should be the correct shape of text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean? I think it should be a 1d vector of the size number of channels. Could you check if the number of channels should be 28 or 3?

You can use this tool to inspect the checkpoint file and find out what shape moving_mean is of: https://github.com/petewarden/tensorflow_makefile/blob/master/tensorflow/python/tools/inspect_checkpoint.py

Looks like in the model, the shape of moving_mean is inconsistent with that in the checkpoint, and you need to make some changes accordingly. You will need to debug a bit: you can add "pdb.set_trace()" inside batch_norm and check the shape of moving_mean to see how that shape is derived and what change you can make to make the shape consistent with the one in the checkpoint.

Member

zhangyaobit commented Sep 27, 2017

What should be the correct shape of text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean? I think it should be a 1d vector of the size number of channels. Could you check if the number of channels should be 28 or 3?

You can use this tool to inspect the checkpoint file and find out what shape moving_mean is of: https://github.com/petewarden/tensorflow_makefile/blob/master/tensorflow/python/tools/inspect_checkpoint.py

Looks like in the model, the shape of moving_mean is inconsistent with that in the checkpoint, and you need to make some changes accordingly. You will need to debug a bit: you can add "pdb.set_trace()" inside batch_norm and check the shape of moving_mean to see how that shape is derived and what change you can make to make the shape consistent with the one in the checkpoint.

@tumusudheer

This comment has been minimized.

Show comment
Hide comment
@tumusudheer

tumusudheer Sep 27, 2017

Hi @zhangyaobit ,

Here is the output of the inspect_checkpoint for all tensors:
inspect_output.txt

For tensor_name: text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean
[ 0.0456978 0.01166576 0.01197682]
The shape is 3.
Similarly, the shape of text_box_300/conv7/BatchNorm/moving_mean is 19.

But the checkpoint was the result of graph trained on GPU with operations taking data in NCHW format. When I created the graph to run inference on CPU (with NHWC data inputs) and restoring from GPU checkpoint, I'm getting the above errors.
Are you saying that when I'm creating the graph to run inference on CPU, the dimensions are not consistent with checkpoint dimensions ? Should n' t they be different because the checkpoints are the result of Graph created with NCHW operations and I'm trying to restore it with graph created with NHWC operations ?

As per your suggestion to making changes, do I need to make changes to the tensors while I'm creating the graph to run inference on CPU and take NHWC data format or do I need to make changes after creating the graph and while restoring weights ?

Thank you very much for taking your time to assist me with this issue.

tumusudheer commented Sep 27, 2017

Hi @zhangyaobit ,

Here is the output of the inspect_checkpoint for all tensors:
inspect_output.txt

For tensor_name: text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean
[ 0.0456978 0.01166576 0.01197682]
The shape is 3.
Similarly, the shape of text_box_300/conv7/BatchNorm/moving_mean is 19.

But the checkpoint was the result of graph trained on GPU with operations taking data in NCHW format. When I created the graph to run inference on CPU (with NHWC data inputs) and restoring from GPU checkpoint, I'm getting the above errors.
Are you saying that when I'm creating the graph to run inference on CPU, the dimensions are not consistent with checkpoint dimensions ? Should n' t they be different because the checkpoints are the result of Graph created with NCHW operations and I'm trying to restore it with graph created with NHWC operations ?

As per your suggestion to making changes, do I need to make changes to the tensors while I'm creating the graph to run inference on CPU and take NHWC data format or do I need to make changes after creating the graph and while restoring weights ?

Thank you very much for taking your time to assist me with this issue.

@zhangyaobit

This comment has been minimized.

Show comment
Hide comment
@zhangyaobit

zhangyaobit Sep 27, 2017

Member

Sure, my pleasure.

I could be mistaken, but my understanding is that using NCHW or NHWC doesn't affect the format of variables, so the checkpoint should be the same regardless of whether NCHW or NHWC is used (You can verify this by comparing the checkpoints respectively trained with NCHW and NHWC).

And when you restore the checkpoint, you load the exactly same the variables regardless whether NCHW or NHWC is used (note that NCHW is the format of data input, not the filter input of Conv, filter input is a variable and is always stored in the same format [filter_height, filter_width, in_channels, out_channels] regardless of whether NCHW or NHWC is used, that's why checkpoint is not affected). However, when you restore the checkpoint, I think somewhere in your model, you need to set the right shape of moving_mean (in addition to setting the format to NHWC).

Could you add a breakpoint (pdb.set_trace()) in batch_norm, and examine why the shape of moving mean is 28 instead of 3 for text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean?

Member

zhangyaobit commented Sep 27, 2017

Sure, my pleasure.

I could be mistaken, but my understanding is that using NCHW or NHWC doesn't affect the format of variables, so the checkpoint should be the same regardless of whether NCHW or NHWC is used (You can verify this by comparing the checkpoints respectively trained with NCHW and NHWC).

And when you restore the checkpoint, you load the exactly same the variables regardless whether NCHW or NHWC is used (note that NCHW is the format of data input, not the filter input of Conv, filter input is a variable and is always stored in the same format [filter_height, filter_width, in_channels, out_channels] regardless of whether NCHW or NHWC is used, that's why checkpoint is not affected). However, when you restore the checkpoint, I think somewhere in your model, you need to set the right shape of moving_mean (in addition to setting the format to NHWC).

Could you add a breakpoint (pdb.set_trace()) in batch_norm, and examine why the shape of moving mean is 28 instead of 3 for text_box_300/conv10_box/conv_cls/BatchNorm/moving_mean?

@tumusudheer

This comment has been minimized.

Show comment
Hide comment
@tumusudheer

tumusudheer Sep 27, 2017

Hi @zhangyaobit,

I guess you want me to add this line import ipdb; ipdb.set_trace() in the batch_normalization function ? May I know where to add this line to debug batch_norm ? I'm using tf.contrib.slim.conv2d layers with batch_norm_params.

Here is my conv2D function:

def conv2d(inputs, out, kernel_size, scope,stride=1,activation_fn=tf.nn.relu, 
			padding = 'SAME', use_batch=False, batch_norm_params={}, rate = 1):
	if use_batch:
		net = slim.conv2d(inputs, out, kernel_size, stride=stride ,scope=scope, normalizer_fn=slim.batch_norm, 
			  normalizer_params=batch_norm_params, activation_fn=activation_fn ,padding = padding, rate = rate)
	else:
		net = slim.conv2d(inputs, out, kernel_size, stride=stride, scope=scope, activation_fn=activation_fn,padding = padding, rate = rate)
	return net

As the stack trace of my error is referring to tensorflow code at /usr/local/lib/python2.7/dist-packages/tensorflow/python/ , I edited /usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/ayers/normalization.py and added this line import ipdb; ipdb.set_trace() at the end of
def build(self, input_shape): function but the execution is not being stopped at the debug pointer and I don't see any stacktrace.

Can you tell me where should I add pdb.set_trace() line to debug the code ?

Thank you.

tumusudheer commented Sep 27, 2017

Hi @zhangyaobit,

I guess you want me to add this line import ipdb; ipdb.set_trace() in the batch_normalization function ? May I know where to add this line to debug batch_norm ? I'm using tf.contrib.slim.conv2d layers with batch_norm_params.

Here is my conv2D function:

def conv2d(inputs, out, kernel_size, scope,stride=1,activation_fn=tf.nn.relu, 
			padding = 'SAME', use_batch=False, batch_norm_params={}, rate = 1):
	if use_batch:
		net = slim.conv2d(inputs, out, kernel_size, stride=stride ,scope=scope, normalizer_fn=slim.batch_norm, 
			  normalizer_params=batch_norm_params, activation_fn=activation_fn ,padding = padding, rate = rate)
	else:
		net = slim.conv2d(inputs, out, kernel_size, stride=stride, scope=scope, activation_fn=activation_fn,padding = padding, rate = rate)
	return net

As the stack trace of my error is referring to tensorflow code at /usr/local/lib/python2.7/dist-packages/tensorflow/python/ , I edited /usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/ayers/normalization.py and added this line import ipdb; ipdb.set_trace() at the end of
def build(self, input_shape): function but the execution is not being stopped at the debug pointer and I don't see any stacktrace.

Can you tell me where should I add pdb.set_trace() line to debug the code ?

Thank you.

@zhangyaobit

This comment has been minimized.

Show comment
Hide comment
@zhangyaobit

zhangyaobit Sep 28, 2017

Member

(1)
You can add it at this line: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L311

params_shape is moving_mean's shape: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L366

And params_shape shouldn't be affected by data_format: note that the logic below line 311 always sets the size of params_shape to the number of channels.

(2)
could you also check the shape of input, add pdb.set_trace() at this line: https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L209

input's shape should be NCHW if data_format is specified as NCHW, and should be NHWC if data_format is NHWC. From a quick look at the code, it seems the code might be missing a transpose to correctly set the shape of input. Could you double check that?

You can transpose to NCHW and NHWC using the code below:
y= tf.transpose(x, [0, 3, 1, 2]) # NHWC to NCHW
y= tf.transpose(x, [0, 2, 3, 1]) # NCHW to NHWC

Member

zhangyaobit commented Sep 28, 2017

(1)
You can add it at this line: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L311

params_shape is moving_mean's shape: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L366

And params_shape shouldn't be affected by data_format: note that the logic below line 311 always sets the size of params_shape to the number of channels.

(2)
could you also check the shape of input, add pdb.set_trace() at this line: https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L209

input's shape should be NCHW if data_format is specified as NCHW, and should be NHWC if data_format is NHWC. From a quick look at the code, it seems the code might be missing a transpose to correctly set the shape of input. Could you double check that?

You can transpose to NCHW and NHWC using the code below:
y= tf.transpose(x, [0, 3, 1, 2]) # NHWC to NCHW
y= tf.transpose(x, [0, 2, 3, 1]) # NCHW to NHWC

@tumusudheer

This comment has been minimized.

Show comment
Hide comment
@tumusudheer

tumusudheer Sep 30, 2017

Hi @zhangyaobit,

Sorry for my delay and Thank you very much for going through the TextBoxes code, I really appreciate your help.

  1. I added import pdb; pdb.set_trace() here
    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L311

Here are the findings:

(Pdb) params_shape
TensorShape([Dimension(1024)])
(Pdb) inputs_shape
TensorShape([Dimension(32), Dimension(19), Dimension(19), Dimension(1024)])

params_shape is not the number of channels but instead it was 1024 which is obviously wrong. Somewhere along the lines in the model, data (shapes) is not being transposed/handled correctly for NHWC.

  1. Also added pdb.set_trace() here:
    https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L209
> /home/sudheer/Flipkart/Research/maneesh/tensorflow_1.2/models/TextBoxes-TensorFlow_2.0/nets/txtbox_300.py(210)text_net()
-> end_points = {}
(Pdb) inputs
<tf.Tensor 'fifo_queue_Dequeue:0' shape=(32, 300, 300, 3) dtype=float32>

The shape of the input tensor is being set correctly. That is because I'm sending this extra data_format param to preprocessing function here.
https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/load_batch.py#L44

The code is supposed to be
txt_preprocessing.preprocess_image(image, glabels, gbboxes, height, width, out_shape,use_whiten=FLAGS.use_whiten,is_training=is_training,data_format=FLAGS.data_format)

I guess I need to start debugging the model code to find out the place where dimensions are being set incorrectly.

Added:

Just found one mistake in the model: No matter what my input data_format is , the data_format is always being NHWC here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L312

I guess in the code here https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L288, batch_norm is being called but data_format is not being passed as input params for batch_norm

If I add batch_norm as part of the arg_scope here: https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L386, will the data_format be passed into _fused_batch_norm function ? or do I need to put data_format as part of batch_norm_params here ?
https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L200

Thank you

tumusudheer commented Sep 30, 2017

Hi @zhangyaobit,

Sorry for my delay and Thank you very much for going through the TextBoxes code, I really appreciate your help.

  1. I added import pdb; pdb.set_trace() here
    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L311

Here are the findings:

(Pdb) params_shape
TensorShape([Dimension(1024)])
(Pdb) inputs_shape
TensorShape([Dimension(32), Dimension(19), Dimension(19), Dimension(1024)])

params_shape is not the number of channels but instead it was 1024 which is obviously wrong. Somewhere along the lines in the model, data (shapes) is not being transposed/handled correctly for NHWC.

  1. Also added pdb.set_trace() here:
    https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L209
> /home/sudheer/Flipkart/Research/maneesh/tensorflow_1.2/models/TextBoxes-TensorFlow_2.0/nets/txtbox_300.py(210)text_net()
-> end_points = {}
(Pdb) inputs
<tf.Tensor 'fifo_queue_Dequeue:0' shape=(32, 300, 300, 3) dtype=float32>

The shape of the input tensor is being set correctly. That is because I'm sending this extra data_format param to preprocessing function here.
https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/load_batch.py#L44

The code is supposed to be
txt_preprocessing.preprocess_image(image, glabels, gbboxes, height, width, out_shape,use_whiten=FLAGS.use_whiten,is_training=is_training,data_format=FLAGS.data_format)

I guess I need to start debugging the model code to find out the place where dimensions are being set incorrectly.

Added:

Just found one mistake in the model: No matter what my input data_format is , the data_format is always being NHWC here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/layers.py#L312

I guess in the code here https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L288, batch_norm is being called but data_format is not being passed as input params for batch_norm

If I add batch_norm as part of the arg_scope here: https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L386, will the data_format be passed into _fused_batch_norm function ? or do I need to put data_format as part of batch_norm_params here ?
https://github.com/tumusudheer/TextBoxes-TensorFlow/blob/master/nets/txtbox_300.py#L200

Thank you

@zhangyaobit

This comment has been minimized.

Show comment
Hide comment
@zhangyaobit

zhangyaobit Oct 1, 2017

Member

Great to see the progress! Yes, you can pass in data_format as part of batch_norm_params.

Member

zhangyaobit commented Oct 1, 2017

Great to see the progress! Yes, you can pass in data_format as part of batch_norm_params.

@tumusudheer

This comment has been minimized.

Show comment
Hide comment
@tumusudheer

tumusudheer Oct 10, 2017

Hi @zhangyaobit ,

I was able to solve this issue and the main problem is batch_norm is not getting data_format from conv2D. So I had to pass data_format explicitly as param as part of batch_norm_params or need to add slim.batch_norm to the arg_scope.

Is it not a bug that slim.conv2d is not passing data_format to batch_norm automatically ?

tumusudheer commented Oct 10, 2017

Hi @zhangyaobit ,

I was able to solve this issue and the main problem is batch_norm is not getting data_format from conv2D. So I had to pass data_format explicitly as param as part of batch_norm_params or need to add slim.batch_norm to the arg_scope.

Is it not a bug that slim.conv2d is not passing data_format to batch_norm automatically ?

@zhangyaobit

This comment has been minimized.

Show comment
Hide comment
@zhangyaobit

zhangyaobit Oct 12, 2017

Member

Glad the problem is solved.

Could you double check (print or pdb.set_trace at slim.conv2d) expected normalizer_params is passed into slim.conv2d by the user to help identify if this is a user's usage issue or a bug of slim.conv2d?

(We will need to first double check if user is using this function correctly. If the user is using slim.conv2d correctly, but slim.conv2d doesn't function as expected, then it is a bug of slim.conv2d)

Member

zhangyaobit commented Oct 12, 2017

Glad the problem is solved.

Could you double check (print or pdb.set_trace at slim.conv2d) expected normalizer_params is passed into slim.conv2d by the user to help identify if this is a user's usage issue or a bug of slim.conv2d?

(We will need to first double check if user is using this function correctly. If the user is using slim.conv2d correctly, but slim.conv2d doesn't function as expected, then it is a bug of slim.conv2d)

@zhangyaobit

This comment has been minimized.

Show comment
Hide comment
@zhangyaobit

zhangyaobit Nov 3, 2017

Member

Closing now. Feel free to re-open if needed.

Member

zhangyaobit commented Nov 3, 2017

Closing now. Feel free to re-open if needed.

@zhangyaobit zhangyaobit closed this Nov 3, 2017

@chrisrn

This comment has been minimized.

Show comment
Hide comment
@chrisrn

chrisrn Dec 20, 2017

My resnet101 network is being trained using slim and NCHW format, but it is slower. Epoch time is 17 hours in NHWC format, but using NCHW slows down to 22 hours. It's strange because according to docs, batch_norm, convolution and max_pooling are faster using NCHW. Any thoughts on that?

chrisrn commented Dec 20, 2017

My resnet101 network is being trained using slim and NCHW format, but it is slower. Epoch time is 17 hours in NHWC format, but using NCHW slows down to 22 hours. It's strange because according to docs, batch_norm, convolution and max_pooling are faster using NCHW. Any thoughts on that?

@zhangyaobit

This comment has been minimized.

Show comment
Hide comment
@zhangyaobit

zhangyaobit Dec 22, 2017

Member

@chrisrn, I'm closing this one, as the original issue is solved. Please open a separate issue if needed.

Yeah, for resnet101, NCHW should definitely be faster.

Is there anything non-standard in your version/implemention of resnet101? Have you compare it with the one in tf cnn benchmarks?
https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py

Member

zhangyaobit commented Dec 22, 2017

@chrisrn, I'm closing this one, as the original issue is solved. Please open a separate issue if needed.

Yeah, for resnet101, NCHW should definitely be faster.

Is there anything non-standard in your version/implemention of resnet101? Have you compare it with the one in tf cnn benchmarks?
https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py

@chrisrn

This comment has been minimized.

Show comment
Hide comment
@chrisrn

chrisrn Dec 22, 2017

I am training on benchmarks and I am comparing this resnet with slim resnet. In benchmarks, epoch time using NCHW is 15 hours. In slim, epoch time using NHWC is 17 hours, but using NCHW epoch time is 22 hours.

chrisrn commented Dec 22, 2017

I am training on benchmarks and I am comparing this resnet with slim resnet. In benchmarks, epoch time using NCHW is 15 hours. In slim, epoch time using NHWC is 17 hours, but using NCHW epoch time is 22 hours.

@zhangyaobit

This comment has been minimized.

Show comment
Hide comment
@zhangyaobit

zhangyaobit Dec 22, 2017

Member

There is definitely a problem somewhere.

(1) Do you have the source code available, so that we can take a look?

(2) Could you try https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/profiler respectively for (1) slim with NHWC, (2) slim with NCHW, and (3) tf cnn benchmark with NCHW to identify the discrepancies in terms of runtime breakdown?

(3) Could you double check fused batch norm is on? It should be on by default now. But maybe you are using a older version of TensorFlow which is off by default. See the fused batch norm section here: https://www.tensorflow.org/performance/performance_guide

Member

zhangyaobit commented Dec 22, 2017

There is definitely a problem somewhere.

(1) Do you have the source code available, so that we can take a look?

(2) Could you try https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/profiler respectively for (1) slim with NHWC, (2) slim with NCHW, and (3) tf cnn benchmark with NCHW to identify the discrepancies in terms of runtime breakdown?

(3) Could you double check fused batch norm is on? It should be on by default now. But maybe you are using a older version of TensorFlow which is off by default. See the fused batch norm section here: https://www.tensorflow.org/performance/performance_guide

@chrisrn

This comment has been minimized.

Show comment
Hide comment
@chrisrn

chrisrn Jan 16, 2018

I made fused batch norm and now everything is ok, which means that I can run a slim model through benchmark code and finetune also on imagenet! You can close it for now

chrisrn commented Jan 16, 2018

I made fused batch norm and now everything is ok, which means that I can run a slim model through benchmark code and finetune also on imagenet! You can close it for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment