Using tf.data.Dataset.list_files prints "unshardable source dataset" warning #55474

dmho418 · 2022-04-03T08:31:17Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04 LTS
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v2.8.0-rc1-32-g3f878cff5b6 2.8.0
Python version: 3.8.10

Describe the current behavior

Using distributed strategy and tf.data.Dataset.list_files prints a unshardable source dataset warning.

Describe the expected behavior
API should detect that the source dataset is files and can be sharded.

Reference: https://www.tensorflow.org/tutorials/distribute/input

Standalone code to reproduce the issue

strategy = tf.distribute.MirroredStrategy()
ds = tf.data.Dataset.list_files(".*", shuffle=False).map(lambda x: (tf.strings.length(x), tf.strings.length(x)))
with strategy.scope():
  dummy_model = tf.keras.Sequential()
  dummy_model.add(tf.keras.layers.Dense(1, input_shape=(1,)))
  dummy_model.compile(loss="mse")
dummy_model.fit(ds.batch(4))

2022-04-03 16:22:04.058728: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
op: "TensorSliceDataset"
input: "Placeholder/_0"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_STRING
    }
  }
}
attr {
  key: "_cardinality"
  value {
    i: 1
  }
}
attr {
  key: "is_files"
  value {
    b: false
  }
}
attr {
  key: "metadata"
  value {
    s: "\n\026TensorSliceDataset:350"
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
      }
    }
  }
}
experimental_type {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_DATASET
    args {
      type_id: TFT_PRODUCT
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_STRING
        }
      }
    }
  }
  args {
    type_id: TFT_DATASET
    args {
      type_id: TFT_PRODUCT
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_STRING
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

tilakrayal · 2022-04-04T09:09:40Z

Hello @dmho418 ,

Example mentioned in https://www.tensorflow.org/tutorials/distribute/input#sharding demonstrates how to set the sharding policy:

dataset = tf.data.Dataset.from_tensors(([1.],[1.])).repeat(64).batch(16)
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.FILE
dataset = dataset.with_options(options)

This will work as long as the dataset starts with list of files. If the dataset doesn't start with list of files, you will face an error along the lines of Found an unshardable source dataset: name: "foo".
Also please take a look at this comment from the similar issue by google developer.Thanks!

dmho418 · 2022-04-05T01:23:50Z

Hi @tilakrayal

From https://www.tensorflow.org/tutorials/distribute/input#sharding doesn't that mean that the default policy AUTO would shard by FILE automatically if a file-based dataset is detected?

In #45157 (comment) I think it's a different issue. There the user manually iterating the dataset and passing tensors to train_on_batch, so it makes sense that Tensorflow doesn't know about the dataset.

sachinprasadhs · 2022-04-20T18:42:14Z

When the AutoShardPolicy is set, you can choose multiple options to perform shard from here.
When the AutoShardPolicy is set to Auto, it tries to apply FILE based shard, when it fails to do so it throws warning and proceeds to apply DATA shard.
Below is the code reference to show how each case is handled.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/grappler/optimizers/data/auto_shard.cc#L763-L786

dmho418 · 2022-04-23T01:43:02Z

Thanks @sachinprasadhs, I think I misunderstood the documentation.

On the distributed input tutorial it says:

If you have multiple workers and are using tf.data.Dataset.list_files to create a dataset from all files matching one or more glob patterns, remember to set the seed argument or set shuffle=False so that each worker shard the file consistently.

Which seems to imply that a list_files creates a shardable dataset, but in reality list_files just creates a regular TensorSliceDataset that is unshardable by FILE:

tensorflow/tensorflow/core/grappler/optimizers/data/auto_shard.cc

Lines 132 to 138 in e6ba479

    
           constexpr std::array<const char*, 5> kUnshardableSourceDatasetOps = { 
        
               "GeneratorDataset", 
        
               "RangeDataset", 
        
               "SparseTensorsSliceDataset", 
        
               "TensorDataset", 
        
               "TensorSliceDataset", 
        
           };

sachinprasadhs · 2022-09-21T23:48:54Z

@dmho418 , If your issue is resolved, could you please close this issue. Thanks!

google-ml-butler · 2022-09-29T00:33:47Z

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2022-10-06T00:49:27Z

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2022-10-06T00:49:41Z

Are you satisfied with the resolution of your issue?
Yes
No

dmho418 added the type:bug Bug label Apr 3, 2022

google-ml-butler bot assigned tilakrayal Apr 3, 2022

tilakrayal added TF 2.8 comp:dist-strat Distribution Strategy related issues comp:data tf.data related issues labels Apr 4, 2022

tilakrayal added the stat:awaiting response Status - Awaiting response from author label Apr 4, 2022

tilakrayal removed the stat:awaiting response Status - Awaiting response from author label Apr 5, 2022

tilakrayal assigned sachinprasadhs and unassigned tilakrayal Apr 5, 2022

sachinprasadhs added the stat:awaiting response Status - Awaiting response from author label Apr 20, 2022

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Apr 25, 2022

sachinprasadhs assigned aaudiber Apr 25, 2022

sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 25, 2022

sachinprasadhs added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 21, 2022

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Sep 29, 2022

google-ml-butler bot closed this as completed Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using tf.data.Dataset.list_files prints "unshardable source dataset" warning #55474

Using tf.data.Dataset.list_files prints "unshardable source dataset" warning #55474

dmho418 commented Apr 3, 2022 •

edited

Loading

tilakrayal commented Apr 4, 2022

dmho418 commented Apr 5, 2022

sachinprasadhs commented Apr 20, 2022

dmho418 commented Apr 23, 2022

sachinprasadhs commented Sep 21, 2022

google-ml-butler bot commented Sep 29, 2022

google-ml-butler bot commented Oct 6, 2022

google-ml-butler bot commented Oct 6, 2022

Using tf.data.Dataset.list_files prints "unshardable source dataset" warning #55474

Using tf.data.Dataset.list_files prints "unshardable source dataset" warning #55474

Comments

dmho418 commented Apr 3, 2022 • edited Loading

tilakrayal commented Apr 4, 2022

dmho418 commented Apr 5, 2022

sachinprasadhs commented Apr 20, 2022

dmho418 commented Apr 23, 2022

sachinprasadhs commented Sep 21, 2022

google-ml-butler bot commented Sep 29, 2022

google-ml-butler bot commented Oct 6, 2022

google-ml-butler bot commented Oct 6, 2022

dmho418 commented Apr 3, 2022 •

edited

Loading