Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset API 'flat_map' method producing error for same code which works with 'map' method #17415

Closed
sibyjackgrove opened this issue Mar 4, 2018 · 8 comments

Comments

@sibyjackgrove
Copy link
Contributor

sibyjackgrove commented Mar 4, 2018

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.
  3. It shouldn't be a TensorBoard issue. Those go here.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Custom code

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10

  • TensorFlow installed from (source or binary): Binary

  • TensorFlow version (use command below): 1.6.0

  • Python version: 3.5

  • Bazel version (if compiling from source):

  • GCC/Compiler version (if compiling from source):

  • CUDA/cuDNN version: 9.0/7.0

  • GPU model and memory: GeForce GTX 860M

  • Exact command to reproduce: dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset, [file_name], tf.float64))

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

I am trying to create a create a pipeline to read mulitple CSV files using TensorFlow Dataset API and Pandas. However using the 'flat_map' method is producing errors. However if I am using 'map' method I am able to build the code and run it in session. This is the code I am using.

Source code / logs

folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
    print(file_name.decode())
    
    df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
                         usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
    X_data = df_input.as_matrix()
    X_data.astype('float32', copy=False)
    
    return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset, [file_name], tf.float64))
dataset= dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()`

I get the following error: map_func must return a Dataset object. It would also great if you could provide documentation on using Dataset API with Pandas module.

@carlthome
Copy link
Contributor

So if your pipeline works with map why not just use map?

The function passed to flat_map should have the return type tf.data.Dataset by definition. This is not an error (this can be a fun watch if you want to learn why).

@sibyjackgrove
Copy link
Contributor Author

Yes, the pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?

@carlthome
Copy link
Contributor

carlthome commented Mar 5, 2018

There's a difference between what map and flat_map return, and what the function you pass to them is supposed to return. I suppose this could be clarified in their docstrings though. Maybe make a pr for this?

@sibyjackgrove
Copy link
Contributor Author

So map doesn't need a dataset object to be passed to it while flat_map needs it? It would be great if this clarified in the documentation. Could you suggest what I need to change in the code for my function to return a dataset?

@carlthome
Copy link
Contributor

Sorry, this discussion should be had at https://stackoverflow.com/questions/tagged/tensorflow instead. Could you make a question there?

@sibyjackgrove
Copy link
Contributor Author

sibyjackgrove commented Mar 5, 2018

@sibyjackgrove
Copy link
Contributor Author

The solution suggested on StackOverflow was to covert the output of my py_function to a dataset. So I modified my py_function as shown below.

  
  X_data.astype('float32', copy=False)
  d = tf.data.Dataset.from_tensors(X_data)
  return d

However I am still getting the 'map_func must return a Dataset object' error. It would be great if somebody could clarify whether this a bug or some problem with my code.

@mrry
Copy link
Contributor

mrry commented Mar 6, 2018

I posted an answer on Stack Overflow.

@mrry mrry closed this as completed Mar 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants