Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using tf.data.Dataset.list_files prints "unshardable source dataset" warning #55474

Closed
dmho418 opened this issue Apr 3, 2022 · 8 comments
Closed
Assignees
Labels
comp:data tf.data related issues comp:dist-strat Distribution Strategy related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.8 type:bug Bug

Comments

@dmho418
Copy link

dmho418 commented Apr 3, 2022

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04 LTS
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v2.8.0-rc1-32-g3f878cff5b6 2.8.0
  • Python version: 3.8.10

Describe the current behavior

Using distributed strategy and tf.data.Dataset.list_files prints a unshardable source dataset warning.

Describe the expected behavior
API should detect that the source dataset is files and can be sharded.

Reference: https://www.tensorflow.org/tutorials/distribute/input

Standalone code to reproduce the issue

strategy = tf.distribute.MirroredStrategy()
ds = tf.data.Dataset.list_files(".*", shuffle=False).map(lambda x: (tf.strings.length(x), tf.strings.length(x)))
with strategy.scope():
  dummy_model = tf.keras.Sequential()
  dummy_model.add(tf.keras.layers.Dense(1, input_shape=(1,)))
  dummy_model.compile(loss="mse")
dummy_model.fit(ds.batch(4))
2022-04-03 16:22:04.058728: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
op: "TensorSliceDataset"
input: "Placeholder/_0"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_STRING
    }
  }
}
attr {
  key: "_cardinality"
  value {
    i: 1
  }
}
attr {
  key: "is_files"
  value {
    b: false
  }
}
attr {
  key: "metadata"
  value {
    s: "\n\026TensorSliceDataset:350"
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
      }
    }
  }
}
experimental_type {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_DATASET
    args {
      type_id: TFT_PRODUCT
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_STRING
        }
      }
    }
  }
  args {
    type_id: TFT_DATASET
    args {
      type_id: TFT_PRODUCT
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_STRING
        }
      }
    }
  }
}
@dmho418 dmho418 added the type:bug Bug label Apr 3, 2022
@tilakrayal tilakrayal added TF 2.8 comp:dist-strat Distribution Strategy related issues comp:data tf.data related issues labels Apr 4, 2022
@tilakrayal
Copy link
Contributor

Hello @dmho418 ,

Example mentioned in https://www.tensorflow.org/tutorials/distribute/input#sharding demonstrates how to set the sharding policy:

dataset = tf.data.Dataset.from_tensors(([1.],[1.])).repeat(64).batch(16)
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.FILE
dataset = dataset.with_options(options)

This will work as long as the dataset starts with list of files. If the dataset doesn't start with list of files, you will face an error along the lines of Found an unshardable source dataset: name: "foo".
Also please take a look at this comment from the similar issue by google developer.Thanks!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label Apr 4, 2022
@dmho418
Copy link
Author

dmho418 commented Apr 5, 2022

Hi @tilakrayal

From https://www.tensorflow.org/tutorials/distribute/input#sharding doesn't that mean that the default policy AUTO would shard by FILE automatically if a file-based dataset is detected?

In #45157 (comment) I think it's a different issue. There the user manually iterating the dataset and passing tensors to train_on_batch, so it makes sense that Tensorflow doesn't know about the dataset.

@tilakrayal tilakrayal removed the stat:awaiting response Status - Awaiting response from author label Apr 5, 2022
@sachinprasadhs
Copy link
Contributor

When the AutoShardPolicy is set, you can choose multiple options to perform shard from here.
When the AutoShardPolicy is set to Auto, it tries to apply FILE based shard, when it fails to do so it throws warning and proceeds to apply DATA shard.
Below is the code reference to show how each case is handled.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/grappler/optimizers/data/auto_shard.cc#L763-L786

@sachinprasadhs sachinprasadhs added the stat:awaiting response Status - Awaiting response from author label Apr 20, 2022
@dmho418
Copy link
Author

dmho418 commented Apr 23, 2022

Thanks @sachinprasadhs, I think I misunderstood the documentation.

On the distributed input tutorial it says:

If you have multiple workers and are using tf.data.Dataset.list_files to create a dataset from all files matching one or more glob patterns, remember to set the seed argument or set shuffle=False so that each worker shard the file consistently.

Which seems to imply that a list_files creates a shardable dataset, but in reality list_files just creates a regular TensorSliceDataset that is unshardable by FILE:

constexpr std::array<const char*, 5> kUnshardableSourceDatasetOps = {
"GeneratorDataset",
"RangeDataset",
"SparseTensorsSliceDataset",
"TensorDataset",
"TensorSliceDataset",
};

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Apr 25, 2022
@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 25, 2022
@sachinprasadhs
Copy link
Contributor

@dmho418 , If your issue is resolved, could you please close this issue. Thanks!

@sachinprasadhs sachinprasadhs added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 21, 2022
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Sep 29, 2022
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues comp:dist-strat Distribution Strategy related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.8 type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants