Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Dataset api with Estimator in MirroredStrategy, Non-DMA-safe string tensor error #19588

Closed
huangynn opened this issue May 28, 2018 · 8 comments
Assignees
Labels

Comments

@huangynn
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): centos
  • TensorFlow installed from (source or binary): pip install tensorflow-gpu
  • TensorFlow version (use command below):1.8.0
  • Python version:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):4.8.5
  • CUDA/cuDNN version: 9.0
  • GPU model and memory:GeForce GTX 1080Ti * 4
  • Exact command to reproduce:

Describe the problem

Using mutilple gpu by MirroredStrategy, Get ' Non-DMA-safe string tensor may not be copied from/to a GPU.' error

Source code / logs

image
image
image
image

@chengmengli06
Copy link

I meet the same problem @skye in using object detection apis

@skye
Copy link
Member

skye commented Jun 4, 2018

Can you provide code to repro the problem?

@skye skye assigned mrry and unassigned skye Jun 4, 2018
@mrry mrry assigned rohan100jain and guptapriya and unassigned mrry Jun 4, 2018
@mrry
Copy link
Contributor

mrry commented Jun 4, 2018

Rohan/Priya: I'm guessing this is what happens when a tf.string tensor goes through prefetch_to_devices(), but unclear whether it should be handled in the client program (e.g. by splitting out strings from the prefetched dataset) or in the FunctionBufferingResource (e.g. by allowing some outputs to be "host memory" only).

@huangynn
Copy link
Author

huangynn commented Jun 5, 2018

def get_inputs(mode, csv_file, batch_size, label_list, preprocess):
iterator_initializer_hook = IteratorInitializerHook()
def inputs():
is_training = mode==estimator.ModeKeys.TRAIN
ds = tf.data.TextLineDataset(csv_file).skip(1)

    def classification_parse_line(line):
        columns = ['img','label']
        img_name, label = tf.decode_csv(
            line, 
            record_defaults = [[''],['']]) 
        # assume every pic is rgb
        image_decoded = tf.image.decode_png(
            tf.read_file(img_name),
            channels=3)
        image = preprocess(image_decoded)
        """image = image_preprocessing_fn(
            image, 
            image_size,
            image_size)"""
        return image,label
    
    cpu_num = multiprocessing.cpu_count()
    ds = ds.apply(
        tf.contrib.data.map_and_batch(
            classification_parse_line,
            batch_size=batch_size,
            num_parallel_batches=cpu_num))   
    ds = ds.prefetch(None)
    iterator = ds.make_initializable_iterator()

    iterator_initializer_hook.iterator_initializer_func = lambda sess: sess.run(iterator.initializer)
    return ds

return iterator_initializer_hook, inputs

distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(
model_dir=args.model_dir,
tf_random_seed=912,
save_summary_steps=args.save_summary_steps,
save_checkpoints_steps=args.save_interval_steps,
keep_checkpoint_max=5*get_num_replicas(),
train_distribute=distribution,
session_config=session_config
)

classifier = tf.estimator.Estimator(
my_model,
config=config,
params=params)
for epoch in range(args.num_epochs):
logger.info('Starting epoch %d / %d' % (
epoch + 1, args.num_epochs))
classifier.train(
train_ds,
hooks=[train_ds_hook])
classifier.evaluate(
val_ds,
hooks=[val_ds_hook])

@huangynn
Copy link
Author

huangynn commented Jun 5, 2018

Nothing special, its just common code with MirroredStrategy.
If replace MirroredStrategy with OneDeviceStrategy, everything went well

@tensorflowbutler
Copy link
Member

Nagging Assignees @rohan100jain, @guptapriya: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@rohan100jain rohan100jain added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug labels Jun 19, 2018
@rohan100jain
Copy link
Member

As Derek mentioned, currently we don't have a mechanism for specifying if some outputs should be in host memory and we assume (to a large extent) that they'd be in device memory. Strings can't be in device memory, hence the bug. I shall work on having a dynamic method of identifying which outputs should be allocated on the host / device. Stay tuned for a fix in a bit.

case540 pushed a commit to case540/tensorflow that referenced this issue Jun 27, 2018
FunctionBufferingResource. This allows for types such as strings that are
always in host memory to be returned from the FunctionBufferingResource.

Fixes tensorflow#19588

PiperOrigin-RevId: 202206052
@guptapriya
Copy link
Contributor

@rohan100jain can this issue be closed now that your fix has been merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants